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Chapter 1 



Introduction 



1.1 Overview 



In 1837, Darwin published a first sketch of an evolutionary tree, see Fig. II. 1[ This 
new idea that all species evolved over time was under a lot of discussion and not until 
the early 20th century was evolution generally accepted by the scientific community. 
Since then, much research went into the field of evolution. With the help of fossils, and 
by comparing the anatomy as well as the geographic occurrence of species, complex 
evolutionary trees have been created. 

In an evolutionary tree, each leaf represents an existing species and all the interior 
vertices represent the ancestors. The edges of the tree show the relationships between 
the species. 

The first step to modern evolutionary research was the discovery of the double 



Figure 1.1: Darwin's first diagram of an evolutionary tree from his 'First Notebook 
on Transmutation of Species' (1837). 
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Figure 1.2: The DNA - a double helix 



helix structure of DNA (deoxyribonucleic acid) by Watson and Crick in 1953. 
The genetic code is a long chain of bases (Adenine, Cytosine, Guanine, Thymine) 
and triplets of these bases encode the 20 amino acids. A backbone of sugars and 
phosphates holds the bases together, see Fig. 11.21 The amino acids in a cell form 
proteins according to the DNA code. From a chemical point of view, life is nothing else 
than the functioning of proteins. Since the DNA determines which proteins are built, 
a living organism can chemically be described by its DNA, the genetic information 

Each cell of an organism has an identical copy of the DNA. In eukaryotes, the 
DNA is found in a cell nucleus whereas in prokaryotes (archaea and bacteria), the 
DNA is not separated from the rest of the cell. 

During reproduction, the DNA is transmitted to the offspring, so parents and 
children are similar in many ways (e.g. hair color, blood group, disease susceptibility). 

It was not until 2003 that the complete human DNA code was described. 
Currently, the complete DNA sequence of several different species is known (358 
bacteria, 27 archae, 95 eukaryotes, see http://www.ncbi.nih.gov/). By aligning the 
DNA of different species, the similarities and differences of the DNA allow us to 
reconstruct lineages with more accuracy than before; for an example see Fig. 11.31 

It is noticeable that the same four DNA bases and the 20 amino acids are found 
in all organisms. This is strong evidence for having one common ancestor to all the 
species. 
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Figure 1.3: Illustration of the tree of life by Carl R. Woese. There 
are three main branches, the bacteria, archaea and eucarya, source 
http: / / www.life.uiuc.edu / micro / faculty / faculty- woese.htm. 



Evolutionary trees are also called 'phylogenetic trees'. If all the species in the tree 
have a common ancestor, we call the tree a 'rooted tree', the common ancestor is 
called the 'root'. 

I take a closer look at rooted phylogenetic trees. The shape of the tree is 
determined by how speciation occurred. But since speciation is not understood well 
and is dependent on historical events which we might never be able to reconstruct, 
a stochastic model for speciation is needed. I investigate the Yule model and the 
uniform model, two very common models. 

In my thesis, I develop the theory with a view to the following applications in 
biology. 

Rutger Vos and Arne Mooers from the Simon Fraser University (Vancouver) 
recently constructed a supertree for the primates (i.e. lemurs, monkeys, apes and 
humans) as shown in Appendix O 

In Section 12.2.11 we will see that the primate tree is much more likely to have 
evolved under the Yule than under the uniform model. 

With the supertree method, the shape of the primate tree could be determined, 
but there was no information about the edge lengths, i.e. the time between speciation 
events. In [12], edge lengths were estimated by simulations, assuming the (super)tree 
evolved under the Yule model. The authors concluded by asking for an analytical 
approach which I develop in Chapter |3] 

Craig Moritz (UC Berkeley) and Andrew Hugall (University of Adelaide) worked 
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with an evolutionary tree which had edge lengths assigned. The leaves were different 
types of snails. The snails either live in open forest or rain forest. Moritz and Hugall 
asked (pers. comm.) if the rate of speciation for open forest snails differs from the 
rate of speciation for rain forest snails. The rate of speciation is a measure of how 
fast a class of species produces splits in the evolutionary tree. Chapter provides a 
linear algorithm for solving that problem. 

1.2 Short guide to the thesis 

In Chapter |21 two important stochastic models for binary phylogenetic trees are 
introduced - the uniform and the Yule model. Those two models are discussed and the 
Kullbach-Liebler-distance between them is calculated. The KuUbach-Liebler-distance 
turns out to be very useful in deciding whether a given tree evolved under the Yule 
or the uniform model. 

Chapter 121 formulates a test statistic for that decision problem, the log-likelihood- 
ratio test. Instead of estimating the power of the test by simulations, we provide 
an analytic bound for the power by introducing a martingale process on trees and 
applying the Azuma inequality. 

The algorithms in Chapter ^ work in particular for trees under the Yule model. 
In order to verify that a tree evolved under Yule, the test provided in Chapter 01 can 
be applied before running the algorithms. 

After having established all the necessary stochastic background. Chapter HI 
provides a quadratic algorithm for calculating the probability distribution of the rank 
for a given interior vertex in a phylogenetic tree. The algorithm is called RankProb 
and we assume that every rank function on a given tree is equally likely. That is in 
particular the case for the Yule model. The algorithm RankProb is extended to 
non-binary trees as well, again we assume that every rank function is equally likely. 
We call that algorithm RankProbGen. Calculating the probability of having an 
interior vertex u earlier in the tree than an interior vertex v is calculated with the 
algorithm COMPARE in quadratic time. We coded up the algorithms RankProb 
and Compare in Python, see Appendix^ The chapter concludes with an analytical 
approach of estimating edge lengths in a given tree under the Yule model. This 
approach makes use of the algorithm RankProb. 

Chapter El looks at the rate of speciation. Given is a phylogenetic tree with the 
leaves being divided into two classes a and f3. The edge lengths shall represent the 
time between two events. We provide a linear algorithm for the expected time a 
species of class a exists until it speciates and two new species evolve. The average 
edge length is an estimate for the inverse of the rate of speciation. An example for 
the classes a and f3 could be rain forest snails and open forest snails. 

After introducing the stochastic models in Chapter |21 the remaining results 
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Figure 1.4: A rooted binary tree 



in that Chapter are new. The results in Chapter |21 ID and El are new unless 
otherwise stated. Improvements on the algorithms in Chapter 0] and coding them 
up in Python was joint work with Daniel Ford. Chapter |3] was the topic of my 
talk at the New Zealand Phylogenetics Conference in Kaikoura in February 2006 
(http: / /www. math. canterbury.ac.nz/bio/kaikoura06/). 

The rest of this Chapter introduces the basic definitions from graph theory and 
phylogenetics needed for the thesis. Further, some basic results for phylogenetic trees 
are stated. 



1.3 Graphs and Trees 

Definition 1.3.1. A graph G is an ordered pair (V, E) consisting of a non-empty set 

V of vertices and a multiset E of edges each of which is an element of {{x, y} : x,y & 
V}. The degree S{v) of a vertex f G is the number of edges in G that are incident 
with V. A path p in G from vertex x & V to vertex y & V is a sequence p = (fi)i=i,...n, 
Vi G V, such that x = Vi, y = Vn, and {vi, fj+i} G -E for z = 1, . . . — 1. A graph G 
is connected precisely if there exists a path from x to y for all x,y & V. A cycle in a 
graph is a path p = (fi)j=i,...n with vi = f„. The graph G' = (V', E') is a subgraph 
oiGiiV (ZV and E' C E. 

Definition 1.3.2. A tree T = {V,E) is a connected graph with no cycles. A 
connected subgraph of T is a subtree of T. A rooted tree is a tree that has exactly 
one distinguished vertex called the root which we denote by the letter p. A vertex 

V & V with 6{v) < 1 is called a leaf . The set of all leaves of T is denoted by L. 
A vertex which is not a leaf is called an interior vertex. Let V denote the set of 
all interior vertices of T. A binary tree is a tree with 6{v) = 3 for all f G V^. A 
rooted binary tree is a rooted tree with 6{v) = 3 for all w G \ p and 6{p) = 2. 
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abcdefghij k f hi j 

Figure 1.5: A rooted binary phylogenetic X-tree T with X = {a, b, . . . ,k} and the 
subtree T' = T\{f^h,i,j}- 



Let V C V. The subtree T' = T\v' is the minimal (w.r.t. the number of vertices) 
connected subgraph of T containing V. An edge which is incident with a leaf is called 
a pendant edge. A non-pendant edge is called an interior edge. Two distinct leaves 
of a tree form a cherry if they are adjacent to a common ancestor. Let v & V\p with 
6{v) = 2. The vertex v is suppressed in T if we delete v with its two incident edges 
ei = {vi,v),e2 = (f,f2) and then add a new edge e = (wi,f2)- For an example of a 
tree see Fig. 11.41 

Definition 1.3.3. Let T = {V,E) be a rooted tree with leaf set L C \^ and for all 

V & V\p is 6{v) 7^ 2. Let X be a non-empty finite set with |X| = \L\. Let : X — > L 
be a bijection. Then T = (T, 0) is called a phylogenetic (X—) tree with labeling 
function 0. X is called the label set. A phylogenetic tree is also called a labeled tree. 
A tree shape is a phylogenetic tree without the labeling. 

Remark 1.3.4. In the following, for a phylogenetic tree T, we sometimes write Ej- 
instead of E, Vr instead of V, Vr instead of V and Lr instead of L. This notation 
clarifies to which tree the sets refer whenever we talk about several different trees. 

Definition 1.3.5. Let T be a rooted tree. A partial order <t on V is obtained by 
setting vi <t f2 {vi,V2 G V) precisely if the path from the root p to V2 includes Vi. If 
Vi <T V2, we say V2 is a descendant of Vi and Vi is an ancestor of V2. If fi <t "^2 and 
there is no V3 & V with vi <t fs <t ^2, we say V2 is a direct descendant of Vi and 
vi is a direct ancestor of f2. The number of direct descendants of v is ). When we 
talk about a phylogenetic tree, we often write <r instead of <t. 

Definition 1.3.6. Let T = (T, 0) be a phylogenetic X-tree. Let X' C X. The 
phylogenetic subtree T' = T\x' = {T', 0') is a phylogenetic tree where T' is the tree 
^U(x') with all degree-two vertices suppressed (except for the root). The labeling 
function is 0' = 0|x'- The root of T' is the vertex p' which is minimal in the tree 
T' under the partial order <r (see Fig. II. 5|) . Let T' be a subtree of T. Denote the 
subtree T\l^\l^, by T \ T'. 
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abcdefghi j k 
Figure 1.6: A rooted binary phylogenetic ranked X-tree with X = {a, b, . . . ,k} 



Let V & V and let be the label set of all the leaves in T which are descendants 
of V. The subtree Ty is induced by v if % = T\x^- A binary phylogenetic tree is 
balanced if the two subtrees induced by the two direct descendants of the root have 
the same shape. Otherwise, the tree is unbalanced. 

Definition 1.3.7. Let T be a rooted phylogenetic tree. Let the function r be a 
bijection from the set of interior vertices V of T into {1, 2, . . . , |V^|} that satisfies the 
following property: 

if vi <r V2, then r(t>i) < r(t>2) 

{T,r) is called a phylogenetic ranked tree (see Fig. Il.fi|l . The function r is called a 
rank function for T. A vertex v with r{v) = i is said to be in the i — th position of 
T or f has rank i. We write vq- instead of r when it is not clear from the context to 
which tree the rank function r refers. Note that r induces a linear order on the set 
V. We define the set r(T) as 

r(T) = {r : r is a rank function on T}. 



The following Lemma has been shown in [Mj using poset theory. We will give an 
elementary proof using induction. 

Lemma 1.3.8. Let T he a rooted phylogenetic tree. For each v & V , let denote 
the number of elements of V that are descendants of v. Then the number of rank 
functions for T is 

\V\\ 



Note that a vertex v is a descendant of itself by definition, so also counts the vertex 

V. 



Proof. This proof is done by induction over the number n of interior vertices of a tree. 
For n = 1, there is only one rank function, the only interior vertex has rank 1, which 
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equals to ^ ^ = y = 1. Suppose that (jl-H) is true for all trees with n < k interior 

vertices. Let T be a tree with k interior vertices. The degree of root p is 6{p) = m 
where m < k. T has m vertex-disjoint rooted subtrees 7i, 72, . . . , 7^ induced by the 

direct descendants of n, and with |Vr | < k. Each subtree % has i-r , different 

rank functions by the induction assumption. Counting all the rank functions on T is 
equivalent to counting the rank functions on each subtree % and then combining the 
positions of the vertices of all the % to get a linear order on VV, by preserving the 
order of the vertices of each %. For a given rank function on each we can order 

all the interior vertices in /, . different ways where the order within each % is 

preserved. Multiplying by all the possible rank functions for each % yields to 



\<iT)\ 




(lVVl-1)! 



n 

|VV|! 

This establishes the induction step, and thereby the theorem. □ 

Remark 1.3.9. In the following, all trees shall be rooted. The set of all binary rooted 
phylogenetic trees with label set X is denoted by RB{X). The set of all ranked binary 
rooted phylogenetic trees with label set X is denoted by rRB{X). 

Remark 1.3.10. A rooted binary phylogenetic tree with n leaves has \V\ = n — 1 
interior vertices and \E\ = 2{n — 1) edges, which is shown by induction in 



Chapter 2 

Stochastic Models on Trees 



Given a phylogenetic X-tree, we are interested in the probability of that tree from 
the set RB{X) or rRB{X), depending on whether the given tree is ranked or not. 
When defining a probabihty distribution on trees, the probabihty of a labeled tree 
should be invariant under a different labeling. This property is called exchangeability. 

There are several stochastic models for binary phylogenetic X-trees, the most 
common are the uniform and Yule model which we will introduce and compare. 

In the following, for simplifying notation, any X with |X| = n shall be X = 
{1, 2, . . . , n} and we write RB{n), rRB{n) instead of RB{X), rRB{X). 

2.1 The uniform model 

Under the uniform model, a random element of RB{n) is generated in the following 
way (c/. Figure ITTj) : 

• Label the two leaves of a cherry with 1 and 2. 

• Add to the cherry a third edge connecting the root p of the cherry and a new 
vertex p' which is earlier than p. This extended cherry is denoted by T. 

• In each step, modify T in the following way, until T has n leaves: 

— Let the number of leaves of T be k. Choose an edge of T randomly and 
with uniform probability and subdivide this edge to create a new vertex. 

— Add an edge from the new vertex to a new leaf. 

— Label the new leaf by k + 1. 

• Remove from the tree T the vertex p' and its incident edge to get the binary 
rooted tree T . 



9 
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P' 




142 3142 31 2431 2431 2 34 



Figure 2.1: Tree evolving under the uniform model. Let X = {1,2,3,4}. Given the 
tree T' with label set {1,2,3}, which has probability 1/3 under the uniform model, 
there are five possible edges to attach the leaf with label 4. Each of the five trees with 
label set {1,2,3,4} has probability 1/5 given T'. So the overall probability of each 
tree with four leaves is 1/15 under the uniform model. 



In this way, each rooted binary phylogenetic X-tree has equal probability (see [llj). 
Obviously, the probability of a tree is invariant under a different leaf labeling. Note 
that it is not necessary to choose the elements of X in the given order 1,2, ... ,n. 
We could choose the leaf labels in any order. This will not be the case for the Yule 
model. 

Lemma 2.1.1. For each n >2, 

n\cn-i 



(2n-3)!! 



)n-l 



with {2n — 3)!! = {2n — 3) • (2n — 5) ... 5 ■ 3 ■ 1 and Cn being the n-th Catalan number, 
Proof. 



^" n+l \n) 



{2n-3)\ 



(2n-3)! {2n-3)\ 



2"-2(2!^)! 2"-2(r2-2)! 

(n-l)!(2(«-l))! ,i/2(n-lh 

{^n - 2)1 ^ 2(n-l)! ^ n-n \ n-1 ) 

2"-i(n-l)! 2"-i 2"-i 

n\Cn-l 
On-1 



□ 
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The following result is already shown in [14j by considering unrooted trees and 
defining a bijection from unrooted to rooted trees. This proof is direct. 

Theorem 2.1.2. The number of binary rooted phylogenetic trees is 

\RB{n)\ = (2n-3)!! 



Proof. The proof is done by induction over n. For n = 2, we have |-R-B(2)| = 1 and 
(2 ■ 2 - 3)!! = 1. Assume \RB{n)\ = (2n - 3)!! holds for all n < k, where k > 2. A 
tree Tk with k leaves has 2{k — 1) edges (see Remark ()1.3.10|) ). Denote the root of 7^ 
by pk- The {k + l)-th leaf x can be attached to Tk to any of the 2{k — 1) edges or 
a new root p with edges ei = {p, Pk) and 62 = {p,x) is added. So we can construct 
2{k — 1) + 1 = 2A; — 1 different trees from 7^. By the induction assumption, we have 
\RB{k)\ = {2k-3)\\. Therefore, \RB{k + l)\ = {2k - ■ {2k - 1) = (2(fc + 1) - 3)!! 
which proves the theorem. □ 

Corollary 2.1.3. Under the uniform model, the probability P[T] of a tree T chosen 
from the set RB{n) is 

1 2"~^ 

p[ri = — = — = - — . 

^ ^ (2n-3)!! n!c„_i 

Proof. Since a phylogenetic tree T is chosen from RB{n) uniformly at random in the 
uniform model, we have 

P[ri = - — 

\RB{n)\ 

By Theorem ^TT^ and Lemma (jmil . we get r[T] = ^^^^3^,, = □ 

2.2 The Yule model 

Under the Yule model |H], a random element of rRB{n) is generated in the 
following way (c/. Figure : 

• Two elements of X are selected uniformly at random and the two leaves of a 
cherry are labeled by them. This cherry is denoted by T and its root has rank 
1. 

• In each step, modify T in the following way, until T has n leaves: 

— Let the number of leaves of T be k. Choose a pendant edge of T uniformly 
at random and subdivide this edge to create a new interior vertex with 
rank k. 
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Figure 2.2: Ranked tree evolving under the Yule model. Let X = {1, 2, 3, 4}. Suppose 
the ranked tree T' with label set {1,2,4} evolved under the Yule model. There are 
three possible pendant edges to attach the leaf with the remaining label 3. Each 
ranked tree with label set {1,2,3,4} has probability ^|^^_-^^, = 1/18 according to 
Theorem 



— Add an edge from the new vertex to a new leaf. 

— Select an element of X which is not in the label set of T uniformly at 
random and label the new leaf by that element. 

In other words, any pendant edge of a binary tree is equally likely to split and give 
birth to two new pendant edges. The Yule model is therefore an explicit model of 
the process of speciation. This makes it a very important model for the distribution 
on trees. Since the labels are added uniformly at random, the probability of a tree 
is invariant under a different leaf labelling (i.e. dependent only on the 'shape' of the 
tree). 

Note that under the Yule model, at each moment in time, the probability of a 
speciation event is equal for all the current species. For different points in time, these 
probabilities can be quite different though. 

Under the Yule model, balanced trees are more likely than unbalanced trees 
whereas under the uniform model, every tree is equally likely. Phylogenetic trees 
constructed for most sets of species tend to be more balanced than predicted by the 
uniform model, but less balanced than predicted by the Yule model. That can be 
explained in the following way. In nature, we observe that a species, which has not 
given birth to new species for a long time, is not very likely to give birth in the 
future either. The Yule model does not take this fact into account. In there is an 
extension of the Yule model described which takes care of that biological observation. 
One special case of the extended Yule model assumes, that unless a species has 
undergone a speciation event within the last e time interval, it will never do so. It is 
shown in that for sufficient small e, this model induces the uniform distribution. 
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So the uniform model can also be interpreted as a process of speciation. 

The Yule and the uniform model can be put in a more general framework. In pp, 
the bet a- splitting model is introduced, where the Yule and the uniform model are 
special cases. In 0, the alpha model is introduced and again, the Yule and the uniform 
model are special cases. In both papers, a one parameter family of probability models 
on binary phylogenetic trees is introduced which interpolates continuously between 
the Yule and the uniform model. 

These models are far more complicated than the uniform and Yule model though, 
and since especially the Yule model is still a reasonably good model for speciation, 
we will now focus on properties of the Yule model. Theorem (j2.2.H) and Corollary 
(I2.2.2j) have been established in 0. Here we provide an alternative proof. 

Theorem 2.2.1. The probability under the Yule model of generating a ranked binary 
phylogenetic tree (T, r) G rRB{n) is 



That is a uniform distribution over rRB{n). 

Proof. We calculate the probability P[T, r] by looking at the generation of the tree 
T. In the first step of the generation, we have n possibilities to choose the label for 
the left leaf of the cherry and n — 1 possibilities to choose the label for the right leaf 
of the cherry. So the probability for a certain cherry, with distinguishing between left 
and right vertex, is , since the selection of the labels is uniformly at random. 

The root of the cherry has rank 1. When adding a new leaf to a tree with k leaves, 
we have k possibilities to choose a pendant vertex and n — k possibilities to choose 
a label. So the probability of attaching a new labeled leaf to a certain edge is 
since we choose the pendant edge and the label uniformly at random. The new interior 
vertex has rank k. Let the new leaf be x. The leaf x shall be on the right side of the 
new cherry. With the process above, we get two equal trees precisely if every step 
of the tree generation process is equal for both trees. While distinguishing between 
left and right child of an interior vertex, we count each phylogenetic tree 21^' = 2"~^ 
times. Therefore, we get the following probability for the ranked phylogenetic tree 



p[r , r] 



n\{n — 1)! 




Since P[T, r] is independent of T and r, we have a uniform distribution. 
Corollary 2.2.2. The number of ranked phylogenetic trees is 



□ 
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Proof. Since P[T, r] = is uniform under the Yule model and probabilities add 

up to 1, we have "g^Ii^''' different ranked phylogenetic trees. □ 



Lemma 2.2.3. Let A be a finite set and for each a & A, let B{a) be a finite set 
and let Q = {{a,b) : a G A, 6 G B{a)}. Let C = (Ci,C2) be the (two-dimensional) 
random variable which takes a value in Q selected uniformly at random, i.e. F[C = 
(a, 6)] = l/|fi| for all (a, 6) G fl. Then the conditional probability distribution P[C = 
{a,b)\Ci = a] is uniform on B{a). 

Proof. We have 

F[C = {a,b)] 1 



r[C = {a,b)\C, 



' P[Ci = a] MF[Ci = a] 

which is independent of b and therefore is uniform on B{a). □ 

Theorem 2.2.4. Assume a given binary phylogenetic tree T with n leaves evolved 
under the Yule model. Then the probability of a rank function r on a given tree T is 

i.e. P[r|T] is uniform over all rankings r of T . 

Proof. Consider the probability distribution induced by the Yule model on A = 
RB{n). Let B{a) be the set of all rankings for a tree a G A and let Vl = {{a,b) : 
a G A, 6 G B{a)}. Let C = (Ci,C2) be the (two-dimensional) random variable which 
takes a value in Q. The random variable C is uniform on the set Q by Theorem ()2.2.H) 
and we can apply Lemma ()2.2.3|) to obtain 

{T,r)\C, = T]=F[r\T] ^ 



MnCi = T] 



which shows that P[r|T] is uniform over all rankings r of T. Since for a tree T, we 
have j-^ ^ possible rankings by ()1.3.8|) . and |y| = n — 1 for binary trees, we get 

L I J |v-|! (n- 1)! 

□ 



The following Corollary was established in |lj using induction. 
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Corollary 2.2.5. The probability of a binary phylogenetic tree T G RB{n) under the 
Yule model is 

on— 1 

P[T] = 



where is as defined in Lemma M.3. t^) . 

Proof. With Theorem ()2.2.H) and Theorem ()2.2.4|) we get 

P[T,r] 2"-i {n-l)\ 2"-^ 



PiTl 



[r|r] n\{n-l)\ H^^y X. R.^y A/ 



□ 



Example 2.2.6. Recall again the ranked tree {T,r) in Fig. 11.61 In that tree, X = 
{a, b, . . . ,k} and n = \X\ = 11. Let Py [T, r] be the probability that the ranked tree 
(T, r) evolved under the Yule model. With Theorem ()2.2.1|) . we get 

on— 1 olO 

FyIT, r] = — = — ^ 0.71 X 10"^^ 

^ ' J n\{n-l)\ 111 X 10! 

With Corollary (j2.2.5jl . we get 

PyfTl = —= — = — ^ 0.21 X 10^^ 

n^-Uvev^v 111x15x2x3x4x5x10 

With Theorem (imi . we get 

r n„f=f/ A,; 1^X2x3x4x5x10 o 

^^t"'^] = F?T)! = Toi " ^ 

Let Pt/[T] be the probability that T evolved under the uniform model. Then, 

P^[r] = l/(2ra - 3)!! ^ 0.15 x 10"® 

Since ^ |§ x 10^ = 14 > 1, i.e. Py[T] > Pc/[T], the tree T (without a ranking) 
is more likely to have evolved under the Yule model. 

Remark 2.2.7. In Chapter |3J we want to calculate for a given phylogenetic tree T 
the probability P[r(f) = i,r & r{T)\T] for a G under the Yule model where 
r(T) as defined in ()1.3.7j) . By Theorem ()2.2.4j) . the rankings for T all have the same 
probability, and therefore 

P|rM^Vre r(T)|r] J{- ''I . 

|r(T)| 

For the value |r(T)|, a formula is stated in Lemma 11.3.81 The value \{r G r{T) : 
r{v) = i}\ will be calculated with the algorithm RankCount. 
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ABC A C B B C A 



Figure 2.3: Vertex in Tp with three direct descendants. There are three possible binary 
resolutions. 

Remark 2.2.8. Another stochastic model on trees is the coalescent model. The 
coalescent model starts with n species and goes back in time. At each event, two 
species are selected uniformly at random and the two species are joint together, the 
joint being a new species, the ancestor. So after n — 1 joining events, we are left with 
one species, the root of the tree. 

With i remaining species, we have (*) possibilities to choose two species for the 
joint. The probability for a specific ranked tree is therefore 

1 

which is equivalent to the Yule model. 

Thus, the Yule model and the coalescent model are equivalent as long as edge 
lengths are not considered. 



2.2.1 Did the primate tree evolve under Yule? 

Consider the primate tree Tp in Appendix O Tp has = 218 leaves. We want to 
calculate the value J" in order to decide whether to favor the Yule model over the 

uniform model. Note that Pt/fTl = and PyfTl = , . 

In Tp, there are six vertices (vertex labels 48, 63, 148, 153, 157 and 200) with more 
than two direct descendants because the exact resolution is unclear. Five of those 
vertices have three direct descendants. 

For each vertex with three direct descendants, there are three possible binary 
resolutions, see Fig. 12.31 
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Let u he a vertex of Tp with three direct descandants. Let v be the additional 
vertex for a binary resolution of vertex u. For the three different binary resolutions 
of vertex u, we also write vi, f 2, fs instead of v, see Fig. 12.31 

Let T' be a binary resolution of Tp. Let i = 1,2, 3, be a binary resolution of 
Tp where vertex u is resolved as displayed in Fig. 12.31 Let Xv{T') be the number of 
descendants of v in resolution T'. We want to estimate A„. 



Xy — 



r< 

i=\ r' 

2" 



i=l T.' 
3 



i=l T' 



n 



on 

EE 

i=l T' 



EE- 

,-1 Tt n\ 



i 

2" 



1 7:' -^w; 

E3^E 



n 

■i«e{v^/\fi} 
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X 




Figure 2.4: Vertex in Tp with four leaf-descendants. 



Note that the inner sum is constant for all i, so we get 



' 1 /t 

E^-pf — El 

' II i=l 



E 



^ 1 



n\ JJ ^' 

uie{y^/\i'i} 



1 

^ a7 

With this formula, we estimate the values A^ for the new vertex v in the binary 
resolution of vertex 48, 63, 153, 157 and 200. 

The interior vertex with label 148 has four leaves as direct descendants. There 
are two different shapes ti and t2 for a binary tree with four leaves, see Fig. 12.41 In 
tl, the new interior vertices Vi and Wi have the value A„^ = 1 and A^^ = 1. In the 
new vertex V2 has A„2 = 2, the new vertex W2 has A^j = 1- We set Ai„ = 1 in 7^ since 
A^j = 1 and = 1- We want to estimate A^,, the value A„ shall be the weighted 
sum of the A, 



~ ^ Fy[t,]\ + Mt2]K, = 1/3 ■ 1 + 2/3 ■ 2 = 5/3. 

With those estimated values for A^,, we now estimate f^jj^- Let %,{ = 1, . . . , m, be 
the binary resolutions of T. We get 

^^P^ = t^' ^ — ^ 0.25 X 10^^ 
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which favors the Yule model over the uniform model. Note that without the estimates 
for A„, we would have to calculate Py[7i] and P(7[7i] for the 3^ x 15 linear resolutions 
of T. 

In Section 14. 3t we will assume that the primate tree Tp evolved under the Yule 
model. 



2.3 Yule model vs. uniform model 

As we have seen in Corollary ()2.1.3p . the probability of generating a given tree T 
with n leaves under the uniform model is 

on— 1 
nlCn-l 

By Corollary (|2.2.5p . the probability of generating a given tree T under the Yule 
model is 

The fraction of the two probabilities, the 'Bayes factor' [0], is 

Py[T] ^ Cn-i 

Given a tree T, we want to know if it evolved under the Yule or the uniform model. 
The fraction being bigger than 1 suggests favoring the Yule model, the fraction 

being smaller than 1 suggests favoring the uniform model. So In ( 5^43- ) being bigger 



than suggests favoring the Yule model, the logarithm being smaller than suggests 
favoring the uniform model. In the following, we want to calculate the expected value 
Ey[ln ( )], given the tree T evolved under the Yule model. We will see that 



Y 



In 



is the 'Kullbach-Liebler' distance (defined below) between Py and 

goes 



1 / irulT] 



u[r] 

P[/, and show that it goes to infinity with increasing n. Further, E, 

to infinity with increasing n. Therefore, for n large enough, the value In ^ j is 
relevant to the question of testing whether a tree evolved under the Yule or uniform 
model. In Sect ion l374| we will actually test the Yule model against the uniform model. 



2.3.1 The Kullbach-Liebler distance 

Definition 2.3.1. Let X be a discrete random variable which takes values in the 
finite set Q = {wi,W2, ■ ■ ■ , w„} with associated probabilities {p{uji) , p{u2) , ■ ■ ■ ,p{un)}- 
We call this probability distribution p. The information content of an event u; G f2 is 

/(cu) = — \np{u) 
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The entropy Jp of the probabihty distribution p is defined as 

J, = E[/(X)] = -5^pHlnpH 

In jni, Chapter 7, the entropy Jy for the Yule distribution over RB{n) and the 
entropy for the uniform distribution over RB{n) are calculated. Recall that for 
two functions f{n) and gin), we write f{n) ~ g(n) precisely if lim„^oo = 1- 

For Jy, one has (from 0) 

n-l 



fc=2 

where g{k) = In + In | + ln(fc + 1) — ^ \nk\. Asymptotically, one has 

Jy - nln(n) + cin ^ In(ra) (2.2) 

where Ci = ln(2)ln(|f ) + ln(9) In(^) + 2Li2(|) - 2Li2(|) - 1 ^ 0.493 and 
Li2(x) = j^rft. 

For Su, one has (again from [9j) 

JJc/ = ln|i?5(n)| =ln(2n-3)!! (2.3) 

and asymptotically 

J[7 — n ln(n) + ~ — ln(?7,) (2.4) 
where ca = 1 - ln(2) ^ 0.307. 

Definition 2.3.2. Let p and g be probability distributions over a finite set Q. The 
Kullbach-Liebler distance between p and g is defined as 



dKL{p,q) = X^P(^) In^*^'^ 



g(^)' 



Remark 2.3.3. The Kullbach-Liebler distance is positive definite, i.e. dKiiPiQ) > 
with (Ik hip, g) = iff j9 = g. Notice that dxLip, q) = oo iS there exists a u E fl with 
p{u) > 0, ^(m) = 0. For p = Fy and q = P{/, both dxiip, <?) and dKL{q,p) are finite, 
since Py[T] > and Fu[T] > for all T G i?5(n). Note that the Kullbach-Liebler 
distance between p and q is not symmetric, i.e. we have dKL{p,q) 7^ dKL{q,p) in 
general. 
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Remark 2.3.4. Note that the KuUbach-Liebler distance between the probability 
distributions p and q over the set equals the following expected value 

dKLip, g) = J] In ^ = E,[ln ^] . 

Lemma 2.3.5. LetD, be a finite set. Letp be any probability distribution overQ., and 
let q be the uniform distribution over fi. Then 

Proof. By assumption, q{u!) — for all a; e Jl. Prom the definition of dKLip,Q), 

it follows that 

dKL{p,q) = ^p(a;)ln^^ 

= ^P{^) lnp(u;) -^p{uj) In q{u) 



= -Jp- J]p(a;)ln-i- 

V 11/ ,.,cO 



□ 



2.3.2 KuUbach-Liebler distance between Py and Pc/ 

In the following, we calculate the KuUbach-Liebler distance between the Yule 
distribution Py and the uniform distribution ¥u over RB{n). 

Theorem 2.3.6. Let Py be the Yule distribution and ¥u be the uniform distribution 
over RB{n). The KuUbach-Liebler- distance between those two distributions is 

(iKL(Py,Pt/)=ln(2n-3)!!-n^ ^^^^ 



k + l 

k=2 
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where g{k) is again defined as g{k) = In + In | + ln(A; + 1) — ^lnk\. 
Asymptotically, we have 

dKL{^Y,^u) - cyn ~ -1/2 InH 

with cy ~ 0.186. 

Proof. From Lemma we have c?A-L(Py, Pi/) = Ju - Jy- With Equations (EUl) 

and (Q, we get dj^L(Py,Pc/) = ln(2n - 3)!! - ^^^=2 0- For the asymptotic 
behavior, we get with Equation ()2.2j) and ()2.4j) 

— n \n{n) + — (Jy — n \n{n) + Ciu) ~ — ln(r;,) + 1/2 ln(r;,) 
Jc7 - Jy - (ci - C2)n ~ -l/21n(n) 
Jc/ — Jy — Cyn ~ — l/21n(n) 

where cy = Ci — C2 ~ 0.186. □ 
Corollary 2.3.7. For the expected value Ey[ln^], we get 

P 

Ey[ln— ] - Cyn l/21n(n) 

Pi/ 

So Ey [In ^] — > oo for n — > oo. 

Proof. With Theorem ()2.3.6j) . we get 

P P \T] 

Ey[ln-^]-cyn = Yl 

= d^L(Py,Pc;) -cyn 

~ -l/21n(n) 

That implies dKL(W'y,Fij) ~ cyn and since cy > 0, we have Ey[ln|^] oo for 
n — >• oo. □ 

2.3.3 Kullbach-Liebler distance between ¥u and Fy 

In the following, we calculate the Kullbach-Liebler distance between the uniform 
distribution Fu and the Yule distribution Py over RB{n). 

Lemma 2.3.8. The central binomial coefficient (^^) can be written as 

m J -LA 2j 
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Proof. 

'2m\ (2m)! 22™ ■ 2m ■ (2m - 1) ■ (2m - 2) ... 3 ■ 2 ■ 1 



m / m!m! 2m ■ 2m ■ 2(m — 1) ■ 2(m — l)...4-4-2-2 

m— 1 



i2m 



-j-j- 2m - 2j - 1 

J-l 2(m - ?) 

i=o ^ "'^ 



■)2m 



n 



2j-l 



2? 



Lemma 2.3.9. For t/ie set RB{n), we have 

n-l 



□ 



E Y.^''^^ = Y.^''%ll]\RB{^ + l)\\RB{n-^)\ 
where is defined as in Lemma M.3. <^) . 

Proof. We have A^, G {l,2,...,(n — 1)} since a binary tree T with n leaves has n — 1 
interior vertices. We rewrite the double sum as 

n-l 

r &RB(n) ^(zVr i=l 

To calculate \{(T,v) : T G RB{n),v G Vr, = we have to count all the pairs 
(T, v) with f G Vr having exactly i interior nodes as descendants. For a binary tree, 
this is equivalent to v having i + 1 leaves as descendants {of. Figure EHj) • So for an 
interior vertex v, we choose a subset X' of X consisting of i + 1 elements, which 
shall label the leaf descendants of v. We have [^^i) possibilities to choose those i + 1 
elements. There are \RB{i + 1)| possibilities to build up a binary tree with leaf set 
X' and root v. Let X" = {X \ X') U v, so \X"\ = n - i. For the set X", there are 
\RB{n — i)\ possible binary trees. Combining all those possibilities yields 

\{T,v -.T eRB{n),v G Vr,K = i}\ = (^^^ ^\RB{i + l)\\RB{n - i)\ 

which proves the Lemma. □ 

Theorem 2.3.10. For the distance dKLC^u^'^r) , it holds that 

dKL(^u,'^Y) = nSn - lnc„_i 



where Sn = Yl'i=2 
Lemma \2.1.1]) . 



\ni T-rn-i-l ^ 2j 

i+llli=l 1-^., 
2(3+») 



and Cn are the Catalan numbers as defined in 
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some tree structure 



7 /"""r\ \"\ 



'^71—1 '^'11 




1 1 V" 



Xi 



Figure 2.5: Counting the pairs {T,v) in Lemma ()2.3.9p . The variables (xi, . . . ,Xi^i) 
take any distinct values from X', the variables (xj_|_2, • • • ,Xn-i,Xn) take any distinct 
values from X". 



Proof. By definition of the KuUbach-Liebler distance and with Corollary ()2.1.3j) and 
^rn^ and setting N = \RB{n)\, we have, 



dKL{^U,^Y] 



y P^[r]ln^ 

r&RB{n) ^ ^ J 



in— 1 



n\Cn-l 



E 

re-RB(n) 

y 

r&RB(n) 



In 



n!c„_i 



Cn-i 



1 

iV 

1 



- lnc„_i 



-s — In c 



n—l 



(2.5) 
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where s = J^TeRBin) Si^eVr ^^-^f With Lemma (j2.3.9p and Lemma (j2.1.1|) . we get 

r eRB(n) 
n-1 



i=2 

n-1 



n 



1=2 



n— 1 / ^ 
i=2 
n-1 



n \ci(i + l)! Cn-i-i{n-i)\ 



i + l 



n-1 



(i + l)\{n-i)\ 
{i + l)\{n - i - 1)\ 



^i^n—i—l 



N 

Cn-l 



I - [n-l) ■ dCn-i-l 



i=2 



Nn 



n-1 



i=2 



Inz |^2^^^ [2{n -i-l) 
n — i — 1 



With Lemma ()2.3.8|) we get 



Nn 



n-1 



ft X ^ 

n-1 2j-l 2^ 



92(n-l) TT"^ 

lli=l 2j 1=2 

n-1 



Inz 



i + 



2j-l 
2j 



22(n-j-l) 



n—i—l 

n 



2j-l 
2j 



i=2 
n-1 



In i ^i-^ 2j 



n ^J TT 2j - 1 "tt-^ 2j - 1 
. , X 2? - 1 2? -1-1 

j=i J j=i J j=i 



2j 



i=2 

n-1 

i=2 

n-1 

i=2 
n-1 

i=2 

n-1 



n-1 



liii T-r 2j "-A-^ 2j - 1 

i + l -L-L 2? - 1 

i=j+l j=l 



2j 



Ini "xV' 2(j + z) "t^'2j-1 



- TT 





n—i—l 



2j 



I + 



i=2 



In^ — - (j+z)(2j-l) 
1 y (2(j+0-l)j 

In i "-pr ^ 2j — 1 

^ i=l 2(i+i) 



, . n— J— i 1 i 

m« i — — 



^ + 



2(i+i) 



Combining this result with Equation ()2.5|) estabhshes the theorem. 



□ 
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Lemma 2.3.11. The asymptotic behavior of the n-th Catalan number Cn is 

c„ ~ n In 4 

Proof. With the Stirhng formula, Inn! ~ ralnn — n (see I^DjWe get 

f2n\ 

In c„ = — ln(n + 1) + In I ) 

\nj 

= -ln(n + 1) + ln(2n)! - 21nra! 

~ — ln(?2 + 1 ) + 2n In 2n — 2?2 — 2n In n + 2n 

= -ln(n + 1) + 2raln2 

~ nlnA 

□ 

Theorem 2.3.12. The Kullbach-Liebler distance between P(/ and Py is 
asymptotically 

where cu is a positive constant. 

Proof. From Theorem ()2.H.l()j) . we have 

dKli^U^'^Y) = nSn - lnCn-1 



with Sn = YT ^ 



A=2 



Ini T-rn-i— 1 ^ 2j 
i+lllj=l i_ 



and On being the n-th Catalan number. By 
Lemma ()2.3.11|) . it holds c„_i ~ nln4. In Section [2.3.41 we show that 

ln4 < 1.44 <Sn<S' + N 
for all n > 200 with S' and being some fixed constants. This yields to 

dxii^u, Py) = nSn - In c„_i ~ nSn - n In 4 ~ cun 
with Cu being a positive constant. □ 
Corollary 2.3.13. We obtain 

E;7 [In — ^] ^ oo for n ^ cxD 

Py 



since Eu[ln^] = di^L(P[/,Py) by Remark jOlP 
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2.3.4 Calculating Sn 

In Theorem ()2.3.10|) . we obtain the following formula for the KuUbach-Liebler 

distance between P[/ and Py: 

dKL(Pu,'^Y) = nSn - \ncn-i 

1— — 

with Sn = J2'^=2 [ttt ■ ^n,i\ and an,i = n?=i ¥ • In the following, we will 

^ 2(7+17 

calculate an upper and a lower bound for Sn- Note that {an,i,n G N} is monotone 
decreasing for fixed i and a„^j > 0. So lim„^oo an,i exists. 



a,- := lim 



1-^ 



nTT±=n 1 

j=l 2{i+0 j=l V 

n-l 



1 \ 



2j7 



>o 



i=2 



Ini 

'i + l 



With the property 



ln(l — x) = —X — — < —X 
^-^ % 

i=2 

for < X < 1 (see [12]) and the property 

Y)->f-dx = ln(z) 

.7 = 1 ^ 



we get the following: 



In a,- 



< 



9 ^ 1 



So we have 



2 ^ J 

< --ln(i) 



In the following, we show that 5"^ converges. 



n-l 



5; = E 



1=2 >- 



Ini 

ITT 
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Since X]i^2 ^ converges, it follows that {S'^, n G N} is bounded. The sequence 
{S'^,n G N} is monotone increasing since 4^ • > for all i G N,i > 2. So 
lim„_»oo S'^ exists and we define 

lim S'^ := S'. 

n— >oo 

Now we calculate an upper and a lower bound for Sn. Since aj_„ — > Oj, there exists 
an G N s.t. ai,„ < (1 + 1/-S")aj for all n> N. 



n—l 



1=2 



Ini 



'•n,t 



n-1 



i=Ar+l 



Ini 



:i + l/S')a, 



< (iv-i)+(i+i/5')5; 



Since S'^ is monotone increasing, we get 

Sn<{N + + 1/S')S'^ <{N-1) + {1 + 1/S')S' 
which yields to 



Sn<S' + N. 



Since aj,„ > a^, we have 



n—l r 



1=2 



Ini 



''11,1 



> 



n—l 

E 

i=2 



Ini 
i+1 



S„ 



So we get Sn > S'^ for all n. With Maple, I calculated ^200 ~ 1.44 > In 4. Overall, we 
have 

ln4 < 1.44 <Sn<S' + N 



for all n > 200. 



Chapter 3 

Trees and Martingales 



In this chapter, we have a closer look at the process of the tree generation. We will 
see that the tree generation is a certain stochastic process, a martingale. Under the 
uniform model, the martingale fulfills the conditions for the Azuma inequality. 

We make use of this property at the end of the chapter. We test the Yule 
model against the uniform model with the log-likelihood-ratio test. With the Azuma 
inequality, we find an analytical bound for the power of the test. Since the algorithms 
in Chapter 0] work in particular for trees under the Yule model, it will be useful to 
have a test for deciding whether a tree evolved under Yule. 

First, we provide some basic definitions and properties on conditional probability 
and martingales. 

3.1 Conditional probability and martingales 

Definition 3.1.1. Let X (resp. Y) be a discrete random variable which takes values 
{xi,i G N} (resp. G N}). The conditional expectation 

Z = E[X|r]=^x,P[X = x,|F] 
i 

is a random variable. Z takes values 

j 

on the set {¥ = yi} with probability F[Z = Zi] = ¥[Y = yi]. 

The two equations in the next Lemma are stated in with a brief verification. 
We will give a full proof. 
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Lemma 3.1.2. Let X (resp. Y , U) be a discrete random variable which takes values 
{xi,i G N} (resp. {yi,i G N}, {ui,i G N}j. Further, assume ]E[|X|] < oo. Then, we 
get the following two equalities: 

E[X] = K[E[X\Y]] (3.1) 
E[X\U] = E[E[X\Y,U]\U] (3.2) 

Proof. Let Z = E[X\Y]. We obtain Equation dSH) from 



E[E[X\Y]] = ^^,P[Z = z,] 

i 






x,\Y = yi\F[Y 




Xj,Y = yi] 


j i 


^j^Y = yi] 


j 




= E[X] 





The summation order in (*) can be changed since E[|X|] < oo. 

It is left to verify ()3.2|1 . Let = E[X|y, f/]. The random variable W takes a value 

^ii,i2 = ^k^X = Xk\Y = yj^,U = n^-J 

k 

with probability ¥[Y = yj^^U = Uj.,] where ji G N and ja e N. Let Z = E[W\U]. The 
random variable Z takes a value 

Zi = E[W\U = Ui] 
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with probability F[U = where z G N. We transform Zi to 
= E[W\U = u^] 

= Yl ^n,h^W = Wj,j,\U = Ui] 



n,j2 



Ui 



31,32 k 

^ J]xfeP[X = Xk\Y = yj„U = Ui]F[Y = y,Jf/ 

ji k 

Yl ^'^^[^ = Xk,Y = yj„U = Ui]/F[U = Ui] (**) 

ji k 

Y Y ^'^^[^ = Xk,Y = yj„U = Ui]/F[U = Ui] 

k ji 

Y^k^X = Xk,U = Ui]/F[U = u,] 



Ui 



k 

= E[X\U = Ui] 

The summation order in (**) can be changed since ]E[|X|] < oo. So we obtain 

E[E[X\Y,U]\U = Ui] = E[X\U = Ui] 
for all i G N, i.e. E[E[X\Y,U]\U] = E[X\U]. □ 
Definition 3.1.3. A stochastic process {Zn,n e N} is called a martingale if 

E[\Zn\] < OO WneN 

and 

Z2, . . . , Zn] — Zn- (3.3) 

Remark 3.1.4. Taking expectations of ()3.3|1 with Equation ()3.1|1 gives 

E[Z„+i] =E[Z„]. 

The results of Lemma ()3.1.5j) and Theorem (j3.1.(ij) are already stated in [THj. 
Again, the following proofs are more detailed. 

Lemma 3.1.5. Let {Znjn E N} be a discrete stochastic process with E[|Z„|] < 00. 
Let Y be a vector of discrete random variables. If 

l\Zl, . . . , Zn, Y] — Zn 

then {Zn} is a martingale. 
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Proof. It holds K[Zn\Zi, . . . , Zn] = Z^ since E[Z„|Zi = zi, . . . , Zn = Zn] = Zn- With 
that property and with Equation ()3.2|) . we get 

IE[Zn+l|-Z'l, . . . , ^n] = E[E[Z„+i|Zi, . . . , Z„, Y]|Zi, . . . , Z„] 
= IE[Z„|Zi, . . . , Zr\ 

□ 

Theorem 3.1.6. Let X, Yi, Y2, ■ ■ ■ be discrete random variables such that E,[\X\] < 00 
and let 

Z„ = E[X|Fi,...y„] 
for all n G N. Then {Zn, n ^ N} is a martingale. 

Proof With Equation we get E[|Z„|] = E[\E[X\Yi, . . . ,Yn]\] < 

E[E[|X||yi, . . . ,Yn]] = E[|X|] < 00. To check the second condition for a martingale, 
it is, by Lemma ()3.1.5p . sufficient to show that K[Zn+i\Zi, . . . Z„, Yi, . . . , = Zn- 
We have 

'^[Zn+l\Zl, . . . Zn,Yi, . . . ,Yn\ = E,[Zn+l\Yi, . . . ,Yn] 

= E[E[x|yi,...,r„+i]|ri,...,y„] 

= E[X|Fi,...,K] (fromjH) 

= Zn 

which proves the theorem. □ 
3.1.1 The Azuma inequality 

Let {Zi,i G N} be a martingale. If the random varialbes Z^ do not change too fast 
over time, Azuma's inequality gives us some bounds on their probabilities. 

The following theorem, the Azuma inequality, is stated in ^3] with a detailed 
proof. 

Theorem 3.1.7 (Azuma's Inequality). Let {Zi,i e N} 6e a martingale with 
K[Zi\ = yU. Let Zq = fi and suppose that for nonnegative constants aj, j3j, j > 1, 

-aj < Zj - Zj_^ < (3y 

Then for any i > 0, a > 0.- 

(z) ¥\Z, - > a] < exp{--— -— } 

Ej=i(«j + /?i)^ 

(u) -^^<-a]< exp{--— -— } 

E,=i("i + /5i)' 
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The following corollary will be very useful for the next section. 

Corollary 3.1.8. Let {Zi,i ^ N} be a martingale with E[Zj] = fi. Let Zq = fi and 

suppose that for a nonnegative constant , j > 1, 

I — — 1 1 — ^ 

Then for any i G N; 

P[Z„<0]<exp{-^} 

Proof. Let = /?i = ^ for alH G N and a = jj,. Then inequality (u) in Theorem 
f)3.1.7p establishes the corollary. □ 

3.2 A martingale process on trees under the 
uniform model 

In this section, we assume that a tree T G RB[n) evolved under the uniform model. 
Consider the following setting: 

• Let hu : RB{n) M with hu{T) = In M = in ^^f^. 

• For j G {1, . . .,n}, let Yj : RB{n) RB{j) with Yj{T) = T|{i,..j}. 

• For j > n, let Yj : RB{n) RB{n) with Yj{T) = T. 

• Let Zi = E[hu\Yi,...Yi\. 

We have E[|/ii7(T)|] < oo since T is chosen from the finite set RB{n) and 
maxr^RBin) \hu{'^)\ < oo. With Theorem ()3.1.fi|l . we obtain that {Zi,i G N} is a 
martingale. Note that 

Zi = E[hu\Yi,...Y;\=E[hu\Y;\. 

For alH > n, we have 

Z,=E[hu{T)\Y, = T]=hu{T). 
The expected value fiu of Zn is, with Remark ()2.3.4|1 . 

fiu = = E[hu{T)] = rfxL(Pc/,Py). 

Theorem ()2.3.12|1 shows 
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which means 

IJ,u ~ cun. 

In the following, we want to apply Azuma's inequality to the tree martingale 
{Zi,i G N}. First, set Zq := E[Z„] = (ixL(F;7, Py). To apply Azuma's inequality, we 
have to verify \Zi — < 'tfu for all « G N. 

• For i = I, note that by definiton, we have 

Zi = E[hu{r)\Y^] = E[hu{r)] = dKL{^u,^Y) = Zo 
so \Zi — Zq\ =0. 

• For i > n, note that Zi = E,[hu{T)\T] = hu{T). So \Zi — Zi^i\ = for all i > n. 

• Section (j3.2.1|) will establish \Zi — Zi_i\ < Inn for 2 < i < n. 
With Corollary ()3.1.8|) . we then have 

P[Z„,<0] < exp{- } 

2n[mnj^ 

2 

~ exp{-— p^— r} ^0 for n ^ oo 
2(lnn)"' 

Note that Zn = hjjiT) = In . So for a tree T generated under the uniform model, 
the probability that P[/[T] is smaller than Py[T] tends to quickly with n as the 
number of leaves tends to oo. Therefore the Bayes factor f^j^ is a very good indicator 
as to whether a 'big' tree evolved under the uniform model or not. 

3.2.1 Calculating a bound in the Azuma inequality 

Let {Zi,i G N} be the tree martingale introduced above. We can transform Zi into 

Zi = E[hu\Yi\ 

= huirnriYii 

T£RB{n) 

= J2 InH^^^PlTlF,] 

reRB(n) 

= 5Z I 5^ lnA,-lnc„_i j P[r|r,] 

TeRB{n) \veVr } 



J2 $^inA.|p[r|y,] 

T&RB(n) UeVV 



lnCn-1 
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The random variable Zj therefore takes values 



reRB{n) UeVr 



In Cn-l 



for all t e RB{i). 

Assuming that T was generated under the uniform model, i.e. 



¥[T\Yi = t] 



p[r] \RB{i) 



¥[t] \RB(n) 



we get, for t G RB{i), 



Zi,t — 



reRB{n) \v^Vr 

T\{i,...,i}=i 



\RB{i) 
\RB(n) 



In Cn-l 



\RB{i) 
\RB{n) 



TeRB{n) y^Vr 
...,<}=* 



- lnc„_i. 



Let T be a binary phylogenetic tree. For the subtree T|{i^...^i}, we will write T{i). 
The set of all binary phylogenetic trees with leave set {1, . . . , i — 1, i + 1, . . . n} shall 
be RB{n, i). In the following, we will calculate an upper bound for \Zi — Zi_i\. Note 
that 

— Zj_i|= max \zit — za-i)t(i-i)\- 
teRB{i) ^ ' 
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The difference \zi^t - Z(i^i),t{i-i)\ is 



r G-RB{n) ^gy^ 



\RB{n)\ 



T(i-l)=t(j-l) 



\RB{n 


)\ 


\RB{%- 


1)1 


\RB{n 


)l 


\RB{i - 


1)1 



(2^-3) E Ei^^'^- E E Ei^^^ 



r(i)=t 



t'eRB{i) T eRB(n) t,gvV 

i'(i-i)=t(i-i) r(j)=t' 



t'£RB{i) TeRB{n) „gy^ 
t'(i-l)=t(i-l) r(i)=i 



t'eRB(i) T eRB(n) t,gVr 

t'(i-i)=t(i-i) r(i)=t' 



\RB{n) 



\RBi^-l)\ 
\RB{n)\ 



( 



E 



t'eRB{i 



\ 



E E^^^-- E E^^^- 



t'{i-l)=t(i-l) \ T{i)=t 



E E 

t'eRB{i) r'eRB{n,i) 

t'{i-i)=tli-i) r'{i-i)=t{i-i) 



T eRB{n) ^,gvV 
Tii)=t' 



J 



\ 



E Ei-^- E Ei-^^ 



T eRB{n) ^gvV 

r\i=r' 



r e_RB(n) ^gvv 

r\j=r' 
r(i)=t' 



\RB{n) 



t'eRB{i) T'eRB{n,i) 

t'(i-i)=t(j_i) r'(i-i)=t(i-i) 



E E'-"^'- E E>"^. 



r eRB{n) i,6VV 

T\i=r' 

T(i)=t 



r eRB{n) ^6VV 

r\i=r' 
r(i)=t' 



Define 



s :- 



E E'"^»- E E'"^. 



T(iRB(n) „6VV 

r\j=r' 
r{i)=t 



r\j=r' 
r{i)=t' 



Consider the tree T in Fig. 13.11 Moving leaf z to a new position will change A^, of 
a vertex f , if v is on the path P from Vi to f •. The change of A^, when v <r Vi, is 
yiew _ \^ — the other vertices on that path, we have A"*^"" = A^ + 1. So we get, 
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with the property Inx — Iny = Inx/y, 

I 



E 

re-RB(n) 

T\i=r' 

r(i)=t \ ^<T 



veVr\vi 



A, 



V<TVI 



In 



A,: 



A^ + 1 



+ In A^. - In A' 



TeB.B(n) 

r\i=r' 

T{i)=t 



V<TV> 



veP 



= E 



reRB{n) 
T\i=T' 

r{i)=t 



veVr\vi 
veP 
v<rvi 



^ '■'^1^ E 



v&Vr\vi 

veP 

V<TV'; 



In. 



A„ + l 



+ s 



with 



, A' , 



, J In ^ if A^. < A' 

Note that for any v,w & P with v,w <r Vi or <r v^, we have A^ ^ A^. Tl 
yields to 



n-l 



2 E E"> 

TeRB{n) k=l 
T\i=T' 
T{i)=t 



k + 1 

k 
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Overall, we get, with using the property ln(l + x) < x ior x > 0, 



I , \RB{i-l)\ ^ ^ ^ + 1 

|^M-^(i-i),t(i-i)l < \RB(n)\ 1^ 1^ 1^ 2^^^-^ 

' ^ t'£RB{i) T'&RB{n,i) T€RB{n) k=l 

t'(i-l)=t(i-l) r'(i-l)=t(i-l) T\i=T' 

\RB(i)\ ""'^ 



E E E'Mi+i 



\RB(n), 

' ^ T'eBB{n,i) T(^RB{n) k=l 

T'{i-i)=t(i-i) r\i=T' 
r{i)=t 

\RB{i)\ 



' ^ ' T€RB{n) k=l 



E E'" in 



n-l . 

- E>"fi4 



fe=i 

n-l 



< 



fc=l 



< / —dx 



Therefore, 



1 X 
Inn. 



\Zi - Zi^i\ = max \zi^t - < Inn. 

t&RB{i) 



3.3 A martingale process on trees under the Yule 
model 

In this section, we assume that a tree T evolved under the Yule model. Consider the 
following setting: 

. Let /iy(r) = -/i^(r)=lnM. 

• For j e {1, . . . , n}, let Yj : RB{n) RB{j) with Yj{r) = T\{i_„jy. 

• For j > n, let Yj : RB{n) RB{n) with Yj{T) = T. 

• Let Zi = E[/iy|Fi,...Fi]. 
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Since hy = —hu, the process {Zi, i G N} is a martingale with the same argumentation 
as in Section Further, from Section we get 



J2 $^lnA.|P[r|F,] 

T&RB{n) \t,eVV 



+ In c„_i 



and 

for all t G RBii). 



J2 5^inA.|p[r|y, = t] 

reRB{n) \y<zVT 



+ In Cn-l 



3.4 Hypothesis testing: Did T evolve under the 
Yule model? 

In this section, the hypothesis that a given tree T evolved under the Yule model is 
tested against the uniform model. 

In ^ni, a test between the Yule and the uniform model is developed by counting 
cherries. It is shown that the number of cherries in a tree is normally distributed 
with different expected values for the two models. The power of the test stated in 
[TU] is above 0.90 for trees with more than 80 leaves. The power is only stated as an 
asymptotic result though. 

We will give an analytic result for the power of the log-likelihood-ratio test for 
the Yule model against the uniform model. 

First, we recall the basics about hypothesis testing. In a hypothesis test, we test 
for a given dataset x if a hypothesis Ho is rejected in favor of a hypothesis Hi or if 
Hq is accepted. The hypothesis test is characterized by a decision rule, it decides if 
Hq is accepted. 

The Type I error of a hypothesis test is 

a = f[Ho rejected \Hq true]. 
The Type II error of a hypothesis test is 

/? = F[Ho retained \Hi true]. 

The power of the test is 1 — /3. 

The next Lemma, the Neyman- Pearson Lemma (see JSl); states that for a given 
Type I error, the likelihood-ratio test is the test with the smallest Type II error. 
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Lemma 3.4.1 (Neyman-Pearson Lemma). When performing a hypothesis test 
between two point hypotheses Hq and Hi, then the likelihood-ratio test which rejects 
Ho in favor of Hi when 

¥[x\Ho true] ^ ^ 



¥[x\Hi true] 

with k being some positive constant, is the most powerful test of size a, where a = 
^[ ppHi Vue] — ^1-^0 true] = f[HQ rejected\HQ true] as defined above. 

Note that the log-hkehhood-ratio test, i.e. rejecting Hq if 

In ''!"|^° In t 

F[x\Hi true] ~ 

is equivalent to the hkehhood-ratio test. We will test the Yule model against the 
uniform model with the log-likelihood-ratio test to get the smallest Type II error. 

Let Hq and Hi be the following hypotheses. 

Hq-. T evolved under the Yule model 
Hi'. T evolved under the uniform model 

The decision rule for this test shall be: 

• Zn = \n ¥p\T] > ^ accept Hq. 

• Zn = ln < ^ reject Hq. 

The Type I and Type II error can be obtained with simulations, i.e. construct a 
lot of trees with n leaves under the Yule model and estimate a and f3. 

With the results from the previous sections, we can provide an analytical bound 
for the Type II error. 

A bound for the Type II error of this test is, with Corollary ()3.1.8p and Theorem 

mm, 

(3 = ¥[Ho retained \Hi true] = Fu[\n > 0] 

{nSn - lnc„_i)% 
= exp{ — 1 3.4 
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with Sn and c„ as defined in Theorem (j2.3.1(jp . Asymptotically, we get, with Theorem 

{cunf 

p ~ expj-— -} 

2n(lnn)^ 

, ((1.44-ln4)n)^ 
< exp{- ^^ ; ^ } 

?i 

^ exp{-0.00144- -r} 

So the power of the test, 1 — /3, tends to 1 as n tends to oo. 

With the current bound, the power of the test, calculated by Equation (j3.4|) . is 
bigger than 0.85 only for trees with more than 600 leaves. It is probably possible 
to improve the bound for the Azuma inequality though. If the current bound. Inn, 
could be improved to 1/4 Inn, the power of the test would be bigger than 0.90 for 
trees with more than 50 leaves. A bound of 1/2 Inn would result in a power bigger 
than 0.90 for trees with more than 170 leaves. 



Chapter 4 

The Rank Function 



Consider the primate tree in Appendix O Was speciation event with label 76 more 
likely to be an early event in the tree or a late event? What is the probability that 76 
was the 6th speciation event? Was it more likely that speciation event 76 happened 
before speciation event 162 or 162 before 76? This chapter will provide an answer 
to those questions, under the assumption that each rank function is equally likely, 
which is, in particular, the case under the Yule model. 

The algorithms RankProb, Compare and an algorithm for obtaining the 
expected rank and variance for a vertex were implemented in Python. The code is 
attached in Appendix^ This is joint work with Daniel Ford from Stanford University. 

In Section we will show how to estimate edge lengths in a tree by calculating 
the probability distribution of the rank of a vertex. This question was posed by 
Arne Mooers and Rutger Vos, who constructed the primate supertree and wanted to 
estimate the edge lengths for it (see [IHl)- 

4.1 Probability distribution of the rank of a vertex 

Let T be a binary phylogenetic tree. Specifying an order for the speciation events 
(i.e. the interior nodes) in T is equivalent to introducing a rank function on T. In 
this chapter, we are interested in the distribution of the possible ranks for a certain 
vertex, i.e. we want to know the probability of r{v) = i for a given v & V. In other 
words, we want to calculate P[r(t>) = i\T], with r G r(T), r(T) is the set of possible 
rank functions on the tree T. If every rank function on a given tree is equally likely, 
we have 



A formula for the denominator is given in Lemma ()1.3.8|) . The enumerator will be 
calculated in polynomial time by algorithm RankCount. 




(4.1) 
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Examples of stochastic models on phylogenetic trees where each rank function is 
equally likely: 

• For the Yule model, we have seen in Theorem (j2.2.4|) . that P[r|T] is the uniform 
distribution. 

• As we have seen in Remark (j2.2.8j) . the coalescent model has the same 
probability distribution on rooted binary ranked trees as the Yule model. So 
P[r|T] is the uniform distribution. 

• In the uniform model no rank function is induced when a tree is generated. We 
can assume though that for a given tree T, each rank function is equally likely. 
Then, Equation ()4.H) holds as well. 

Definition 4.1.1. Let T be a rooted phylogenetic tree. Define 

ar,«(0 := \{r ■ r{v) = i, r e r{T)}\ 

for V & V,i & 1, . . . ,\V\. In other words, ar^vi'i) denotes the number of rank functions 
r for T in which v comes in the i-th position. 

The following results will be needed in the next sections. 

Lemma 4.1.2. Let 

X — ^2 ■ ■ ■ I 

2 r 2 2 2 1 

X — X2 . . . -^nj J 

_ r d d d \ 

X — ^2 . . . -^n^i 

he d disjoint sets with the linear order x\ < X2 < . . . < a;^. for each i G {1, . . . , c?}. 
The number ^ of possible linear orders on the set x^ U x"^ U . . .U x'^, with the linear 
order of each original set being preserved, is 

^ \i=\ / 

Proof. The number ^ of linear orders of the Yli=i elements of U U . . . U x*^, 
allowing any order on x*, is ^ = ( Yli=i ) The number .ifj of linear orders of the nj 
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elements of is {ni)\. Since for =Sf, we only allow the linear order x\ < X2 < ■ ■ ■ < x^^ 
on x\ it holds 



d 

Hi 

i=l 



d 

rii 

i=l i=l 



m n 



Corollary 4.1.3. For d = 2 in Lemma I{4-1-^ , we have 

possible linear orders on x'^ U x"^, preserving the linear order on x^ and x^ . 
Proof. From Lemma ()4.1.2j) follows 

2 



5Z 

.i=i / (^^1 + ^2)! (ni + n2 



2 (rii)!(n2)! V ^1 

i=l 



□ 



□ 



Remark 4.1.4. The values (^) for all n,k < N {n,k,N e N) can be calculated in 
0{N^), cf. Pascal's Triangle. In Appendix El a dynamic programming version for 
calculating (^) is implemented. Thus, after 0{N'^) calculations, any value (^) with 
n,k < N can be obtained in constant time in an algorithm. 



4.1.1 Polynomial- time algorithms 

In the following, we give a polynomial algorithm to determine ar,v{i) for v & V and 
i = 1, . . . , |V^| in a binary phylogenetic tree T. 

Algorithm: RankCount(T, v) 

Input: A rooted binary phylogenetic tree T and an interior vertex v. 
Output: The values of ar,v{i) for z = 1, . . . , |V^|. 
1: Denote the vertices of the path from v to root p with 

{V = Xi,X2, ...,Xn=p). 

2: Denote the subtree of T, consisting of root Xm and all its descendants, by 7^ for 
m = 1, . . . ,n. {cf. Figure HH}. 
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3: for m = 1, . . . , n do 
4: for i = 1, . . . , \Vr\ do 

5: aTr^A^ ■= 
6: end for 
7: end for 

8: arU^) := 

11 

9: for m = 2, . . . , n do 

10: T^_^ := %n\Lr,ALT^.^ (^Z Figure 



\Vr' I'- 
ll: Rr' ■- 



12 
13 

14 



15 
16 
17 



n 

for i = m, . . . , \ Vt^\ do 
M:=min{|Vr4_J,^-2} 

E,=o «r„_„.(^ - J - ^)Rt^_, ( 1^^^ ) ( , ) 

m — 1 

end for 
end for 

RETURN aTv-=ar„v 



Theorem 4.1.5. RankCount returns the quantities 

Oir,v{i) = \{r : r{v) = i,r e r{T)}\ 
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for each given v & V and all i & 1, . . . ,\V\. 

Proof. We have to show that all the 07-^ ^,(2) produced by RankCount equal 
the ar^^vii) defined in ()4.1.1|) . In the following, we denote the values ar^^^ii) 
produced by the algorithm with ot^^^i^) and ar^^y{i) shall denote the number of 
rank functions with r{v) = i as defined in ()4.1.1|) . We will show aT^^vii) = c^rl^ul^) 
for m = 1, . . . , ra, « = !,..., \Vr\- This is done by induction over m. 
For m = 1, a7-i,'u(l) = Q^ri^t,(l) since p.3.8|) holds. Vertex v is the root of 7^, so 
(^ri,v{i) = for alH > 1. 

Let m = k and (y%„.v{i) = (^Tmvi^) holds for all m < k. ar^^y{i) = clearly holds for 
all i > iVVfcl since rr^ : v ^ {1, . . . , \V%\}- So it is left to verify that the term (*) 
returns the right values for ar^^y{i). Assume that the vertex v is in the {i — j — l)-th 
position in 7fc_i (with i —j — 1 > 0) for some rank function rr^_^ and v shall be in the 
i-th position in 7^. We want to combine the linear order in the tree Tk_i induced by 
r7-j._i with a linear order in T^'.^ induced by vq-^ ^ to get a linear order on 7^. The first 
j vertices of T^_i must be inserted between vertices of Tk-i with lower rank than v 
so that V ends up to be in the i-th position of the tree 7^. We will count the number 
of possibilities to do so. The tree 7^'_]^ has 

\Vr' |! 



k-1 

possible rank functions. Combining a rank function rrj._i with a rank function rj-^ ^ 
for getting a rank function rr,. with rr^. (f ) = i means inserting the first j vertices of 
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Ti^_i anywhere between the first {i — j — 2) vertices of Tf^-i- There are 

possibihties according to Corollary 14.1 .31 For combining the |VVfe_J — ^ J ^ 1) 
vertices of rank larger than v in Tk-i with the remaining \ Vr^ J ~ 3 vertices in 
we have 

-i^-J-l) + \Vru\-J\ f\V%J + \VnJ - - 1 



possibilities. This follows again from Corollary 14.1 .31 The number of rank functions 
r7-j._i with rr^_j^{v) = i — j — 1 is aTk.i,v{i — j — 1) by the induction assumption. 
Multiplying all those possibilities gives 

<^T,,,v{i) is then the sum over all possible j which is equal to the term (*) for a^^^^{i)- 
This establishes the theorem. □ 

Theorem 4.1.6. The runtime o/ RankCount is 0{\V\'^). 

Proof. Note that the number of rank functions Rr = n ^ on a tree T with V 

interior vertices can be calculated in 0(|V^|), i.e. in linear time. 

Further, note that the combinatorial factors (^) for all n,k < \V\ can be calculated 
in advance in quadratic time, see Remark ()4.1.4|) . In the algorithm, those factors can 
then be obtained in constant time. 

Contributions to the runtime from each line in RankCount (the runtime is 
always w.r.t. |V^|): 
Line 1-2: linear time 
Line 3-7: quadratic time 
Line 8: linear time 
Line 9-16: quadratic time since: 

Line 11: Rq-' _^ can be calculated in 0(| This has to be done for m = 1, . . . , n, 

so overall the runtime for calculating all Rr' _^ is no more than 0(|V^p) since n < \V\. 
Line 14: We add up all calculations needed for obtaining ar^^^^i), m = 1, . . . ,n, 

n n n 

m=2 m=2 m=2 



CHAPTER 4. THE RANK FUNCTION 



48 




Figure 4.3: Illustration for runtime of RankCount 



The last inequality holds since the vertices of the 7^, m = 1, . . . ,n — 1, are distinct. 
Therefore, line 14 contributes a quadratic runtime. 
Line 17: constant time 

So overall, the runtime is no more than 0(|V^p). Figure shows a tree for which 
the runtime of RankCount is actually quadratic. Counting all the calculations for 
term (*) in the algorithm for the tree in 14.31 vields to 

n \VTm\ n |Vr„J 

m=2 i=m m=2 i=m 



5^2(|VrJ-(m-l)) 

m=2 

n 

^2((2m- 1) - (m- 1)) 



m=2 

n 



^2m 



m=2 



= n(n + l)-2 
Since n = {\V\ + l)/2, we have a quadratic runtime. 



□ 



Corollary 4.1.7. The probability F[r{v) = i\T] can be calculated in 0(|l^p). We 
have 

P[r(.) = ^\T] = ^lA^ = ^rA^n.,v^^_ ^4_2) 

Proof. The first equality in ()4.2|) follows from basic probability theory. The second 
equality holds since ^ ^ = ^jar,v(0 by p.3.8p . The complexity of the runtime 
follows from ()4.1.6|) . □ 
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Remark 4.1.8. We will write P[r(f ) = i] instead of P[r(f ) = i\T] in the following. 
With P[r(t') = i] from Corollary ()4.1.7|) . the expected value ^r{v) and the variance 
cr^j.^-) for r(t>) can be calculated by 

\v\ 

^r(v) = ^i^r{v) =i] 
1=1 
\v\ 

i=l 

Example 4.1.9. We will illustrate the algorithm RankCount for the tree in 
Figure lOl We get the following values: 

ari,«(l) = It = 1 

ar,.(2)=ar_„.(l)irnO=2 

«r„.(3)=«r_...(l)irnQ =1 
ar^A^) = 

«r3,.(3) = ar^_Um{'^l-') Q + ar^^UmC^l'") (D = 40 + = 40 
«r3,.(4) = ar^_,Amn') Q + ar^,,^^^^) (?) = 8 + 48 = 56 
«r3,.(5) = ar_„.(3)2n~') (?) + «r„.„.(2)2(^+^^) Q = 18 + 36 = 54 
«r3,.(6) =«^_,.(3)2(^+n (5) +ar_„.(2)2(^+n (3) =24 + 16 = 40 
«r3,.(7)=ar_„.(3)2(^+^-^)(^)=20 
ar3,«(8) = 
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With ara.v = 0!r,v, we get 







= 1] 


F[r 




= 2] 


F[r 




= 3] 


F[r 


» 


= 4] 


F[r 




= 5] 


F[r 




= 6] 


F[r 




= 7] 






= 8] 








40 



40 + 56 + 54 + 40 + 20 
28 



40 
210 



20 
105 



105 
27 

105 
20 

105 
10 

105 




Therefore, the expected value fir{v) is 

8 

I^Hv) = J^iP[r(t;) 
and the variance (T^/.,n is 



i=l 



497 
105 



4.73 



2 



r(v) 
8 

E 

i=l 



\r[v] 



2 



2513 497^ 



344 



Remark 4.1.10. Note that F\r(v) 



105 1052 

~ T,iO'T,vij)' 



1.53 



225 

Common factors in all 



(^T,v{i),i = !,•••, \ Vz,\ will therefore cancel out. 



The next algorithm, RankProb, is a modification of RankCount such that 
common factors of ar,v{i),i = will not be included. Therefore, the 

numbers we have to deal with in the algorithm stay smaller and the number of 
calculations is reduced. 



Algorithm: RankProb(T, f ) 

Input: A rooted binary phylogenetic tree T and an interior vertex v. 
Output: The probabilities P[r(f) = i] for i = l,...,|V^|. 

1: Denote the vertices of the path from v to root p with 

{V = Xi,X2, ...,Xn=p). 

2: Denote the subtree of T, consisting of root Xm and all its descendants, by 7^ for 

m = 1, . . . ,n. {cf. Figure HH}. 
3: for m = 1, . . . , n do 
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for i = 1, . . . , \Vr\ do 

end for 
end for 

ari,v{l) ■= 1 

for m = 2, . . . , n do 

^m-i := %n\Lr^\Lr^^^ {cf. Figure 

for i = m, . . . ,\Vr I do 

* * I -^m I 

M:=mm{\Vr^J,t-2} 



end for 
end for 

for i = l,...,|Vr| do 

mr(v) =i]— """'"^'^ 



end for 

RETURN P[r(f) =i],i = 1,..., IV^I- 



Theorem 4.1.11. RankProb returns the quantities 

¥[r{v) = i] 

for each given v E V and all i G 1, . . . ,\V\. The runtime is 0{\V\'^). 

Proof. Note that the structure of RankProb is the same as the structure of 
RankCount. The only difference is that common factors to ar^^y{i) for all i are not 
included. Those common factors do not change the probabilities since they cancel 
out once calculating the probabilities. Therefore, since RankCount works correct, 
also RankProb works correct. 

It is left to verify the runtime. The only time consuming step in RankProb is 
line 13. This line is of the same complexity as line 14 in RankCount. Line 14 in 
RankCount contributed a quadratic time. Therefore, the runtime of RankProb 
is quadratic as well. □ 



4.1.2 Non-binary trees and ranks 

Let T be a non-binary phylogenetic tree. Assume that any possible rank function on 
T is equally likely. With that assumption, we have 
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To calculate these probabilities, the algorithm RankProb can be generalized to 
non-binary trees. We call the generalized algorithm RankProbGen. 

Algorithm RankProbGen (T, v) 

Input: A rooted phylogenetic tree T and an interior vertex v. 
Output: The probabilities P[r(f ) = i] for i = 1, . . . , \V\. 
1: Denote the vertices of the path from v to root p with 

(f = Xi,X2, ...,Xn=p). 

2: Denote the subtree of T, consisting of root Xm and all its descendants, by 7^ for 
m = 1 n. 



3 
4 
5 
6 
7 
8 
9 

10 
11 



for m = 1, . . . , n do 
for i = 1, . . . , \Vr\ do 

ar^A^) = 
end for 
end for 

«ri,,;(l) = 1 

for m = 2, . . . , n do 

Label the subtree 7^ \ 7^„i by T^_i {cf. Figure 
M = min{|VV,_J- 1,^-2} 



12: for i = m, . . . , \ Vr^\ do 

M 



^T_J + |l^r4_J-l-(^-l)\ A -2^ 



/ 1 

13: ar^A'^) ■= Yl ^Tm^iA'^ - J - 1) ( 



14 
15 
16 
17 



end for 
end for 

for i = 1,...,|VV| do 

,1 _ aT„,i,(j) 



\r(v] 



18: end for 

19: RETURN P[r(t;) = z], z = 1, . . . , 



Theorem 4.1.12. RankProbGen returns the probabilities 

F[r{v) = i] 

for each given v G V and all i (E 1, . . . ,\V\. The runtime is 0(|yp). 

Proof. The algorithm is the same as RankProb. The only difference is that in each 
step, we define T^_i := Tm\ 7^-i, i.e. the root of 7^ is x^. For any rank function 
on 7^, we now insert the first j elements (excluding the root Xm) before the vertex 
V. The number of ways to insert these vertices is counted analogously to the proof of 
Theorem ()4.1.5|) . The number of possible rank functions on 7^ does not have to be 
calculated, since these factors cancel out when calculating the probabilities. 
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Figure 4.5: Labelling the tree for algorithm RankProbGen. 




Figure 4.6: What is the probability that vertex u has smaller rank than vertex f ? 



Since we do the same iterations as in RankProb, the algorithm RankProbGen 
has quadratic runtime as well. □ 

4.2 Comparing two interior vertices 

Assume again that every rank function on a binary phylogenetic tree T is equally 
likely. We want to compare two interior vertices u and v of T. Was u more likely 
before f or f before u {cf. Fig. 14. 6|) ? In other words, we want to know the probability 

P„<„ ■.= F[r{u) < r{v)\T] 

where r{T) is the set of all possible rank functions on T. This probability is, by 
Theorem ()2.2.4j) . equivalent to counting all the possible rank functions on T in 
which u has lower rank than v and divide that number by all possible rank functions 
on T. The algorithm Compare will solve this problem in quadratic time. 
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Algorithm COMPARE (T, u, v) 

Input: A rooted phylogenetic tree T and two distinct interior vertices u and v. 
Output: The probabihty P„<„ := F[r{u) < r{v)\T\. 
1: Denote the most recent common ancestor of u and ^; by pi. 
if pi = V then 



RETURN P„<^ = 0. 



end if 

if pi — u then 

RETURN P„<^ = 1. 
end if 

Let 7^^ be the subtree of T which is induced by pi. 
9: Delete the vertex pi from 7^^. The two evolving subtrees are labeled 7^ and % 

with u & and v & %. 
10: Run RANKPROB(7J,,ii) and RankProb(7J,, to get ¥[r{u) ^ i] on % and 
P[r(f ) = i] on 7^ for all possible i. 
for i = 1, . . . , \ VrJ do 

ucum{i) := Ylk=i = i] 



11 
12 
13 
14 

15 
16 

17 
18 
19 
20 
21 

22 
23 



end for 

^u<v '■— 

for i = 1, . . . , \ VrJ do 
for j = l,...|1/rj do 

p := F[riv) = t] ■ ('-^+') ■ ('^"""'i:^'^'!^"'"') • ucum{j) (*) 

end for 
end for 

RETURN P„<„ 



Theorem 4.2.1. T/ie algorithm Compare returns the value 

F«<^ = P[rH <r{v)\T]. 

Proof. Note that the probability of u having smaller rank than v in tree 7^^ equals 
the probability of u having smaller rank than v in tree T, since for any rank function 
on 7^j , there is the same number of linear extensions to get a rank function on the 
tree T. 

So it is sufficient to calculate the probability ^u<v in ■ If pi — u, u is before v 
in T and we return Pu<t, = 1. If pi = v, v is before u in T and we return Pu<i. = 0. 

In the following, let pi ^ u, pi ^ v. The run of RankProb gives us the probability 
F[r{u) = i] in the tree 7^ and P[r(w) = i] in 7^ for all i. We want to combine these 
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two linear orders. Assume that r{v) = i and we insert j vertices of 7^ before v. 
Inserting j vertices of 7^ into the hnear order of % before v is possible in 
ways (see Corollarv I4.1.3|) . Putting the remaining vertices in a linear order is possible 
in "') ways. The probability that the vertex u is among the j vertices 

which have smaller rank than v is P[r(M) < j] = ucum{j). There are |r(7^)| possible 
linear orders on 7^ and |r(7^)| possible linear orders on %. The number of linear 
orders where vertex v has rank i in %, v has rank i + j in 7^^ and r{u) < i + j 
therefore equals 

Pi,j = ^r{v) = i] ■ \r{%)\ ■ ^ j ^'^') ■ ('^^"' J 1^"' ■ ■ I'^i'^n)] 

Adding up the p' for each i and j gives us the number of linear orders where u is 
earlier than v. 

Combining a linear order on % with a linear order on 7^ is possible in 

V \Vt.\ 

different ways (see Corollarv I4.1.3p . There are |r(7^)| linear orders on 7^ and \r{%] 
linear orders on 7^, so on 7^^, we have 



tot' := ('^''|^[^''')|r(T.)||r(T.) 



linear orders. Therefore we get 

_ Yli,jPi,j _ YlijPiJ 



tot' tot 

with pij = F[r{v) = i] ■ (*"-^.+-') ■ ■ucum{j). This shows that Compare 

works correct. □ 

Theorem 4.2.2. The runtime o/ COMPARE is 0{\V\^). 

Proof. Again, note that the combinatorial factors (^) for all n, < |V^| can be 
calculated in advance in quadratic time, see Remark ()4.1.4|) . In the algorithm, those 
factors can then be obtained in constant time. 

Contributions to the runtime from each line in COMPARE (the runtime is always 
w.r.t. \V\): 
Line 1: linear time 
Line 2-1: constant time 
Line 8: linear time 
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P 




^~ 

Figure 4.7: Example for COMPARE: Calculate the probability of m < f in the displayed 
tree T. 

Line 9: constant time 

Line 10: quadratic time, since RankProb has quadratic runtime 
Line 11-13: linear time 
Line 14: constant time 

Line 15-20: quadratic time since (*) has to be evaluated |Vr„| ■ |Vr„| < l^rP times 
Line 21-23: constant time 

Therefore, the overall runtime of Compare is 0(|V^p). □ 

Example 4.2.3. Fig. 14.71 displavs the tree T. We want to calculate the probability 
P„<„, i.e. the probability of vertex u having a smaller rank than vertex v. 

A run of the Python code attached in Appendix IHl with input (T, m, v) returns 
F - A 

4.3 Application of RankProb - Estimating edge 
lengths in a Yule tree 

In pni; cL primate supertree on 218 species was constructed with the MRP method 
(Matrix Representation using Parsimony analysis, see |2 1^)- The resulting 
supertree is shown in Appendix O This tree has only 210 interior vertices. There 
are six 'soft' polytomies in the supertree, i.e. six vertices have more than two direct 
descendants because the exact resolution is unclear (i.e. the supertree is non-binary). 

Since for most of the interior vertices, no molecular estimates were available, the 
edge lengths for the tree were estimated. Here, the length of an edge represents the 
time between two speciation events. 

A very common stochastic model for trees with edge lengths is the continuous- 
time Yule model. As in the discrete-time Yule model, at every point in time, each 
species is equally likely to split and give birth to two new species. The expected 
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Figure 4.8: Labeling the tree for estimating the edge lengths. 



waiting time for the next speciation event in a tree with n leaves is 1/n. That is, each 
species at any given time has a constant speciation rate (normalized so that 1 is the 
expected time until it next speciates). 

It was assumed that the primate tree Tp evolved under the continuous-time Yule 
model. In [l^, 10^ rank functions on Tp were drawn uniformly at random. For each 
of those rank functions, the expected time intervals, i.e. the edge lengths, between 
vertices were considered (the expected waiting time after the (n — l)th event until 
the nth event is 1/n). 

The authors of |^ concluded their paper by asking for an analytical approach to 
the estimation of the edge length, and we provide this now. 

4.3.1 Analytical estimation of the edge length 

Let {u, v) be an interior edge in T with u <r v. Let X be the random variable 'length 
of the edge (m, f)' given that T is generated according to the continuous-time Yule 



Since under the continuous-time Yule model, the expected waiting time for the next 
event is 1/n, we have 



It remains to calculate the probability P[r('u) = i, r{v) = j]. We count all the possible 
rank functions where r{u) = i and r{v) = j. The subtree % consists of v and all its 
descendants. The tree 7^ evolves from T when we replace the subtree % hj a. leaf, 
see Fig. Ol 



model. 



The expected length E[X] of the edge {u,v) is given by 



E[X] = E[X\r{u) = i, r{v) = j]F[r{u) = i, r{v) = j]. 
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Note that P[r(M) = i,r{v) = j] = if \ VrJ < j — Therefore, assume \ Vr^ \ > j — l 
in the following. 

The number of rank functions in 7^ is denoted by i?r„- The probability P[r(M) = i] 
can be calculated with RankProb(7^, u). So the number of rank functions in 7^ 
with F[r{u) = i] is F[r{u) = i] ■ Rr^. 

The number of rank functions in 7^ is denoted by i?r„- Let any linear order on 
the tree 7^ and 7^ be given. Combining those two linear orders to an order on T, 
where r{v) = j holds, means, that the vertices with rank 1, 2, ... ,j — 1 in 7^ keep 
their rank. Vertex v gets rank j. The remaining \Vt:^ \ — {j — 1) vertices in 7^ and 
\ — 1 vertices in % have to be shuffled together. According to Corollary (j4.1.3p . 
this can be done in 

VrJ-(j-l) + |VrJ-l\ ^ (\VrJ + \VrJ-J 

ivvj-1 ; V lVVj-1 



different ways. Overall, we have 

F[riu) = z] ■ Rr^ ■ Rr^ ■ 



IVVJ-1 



different rank functions on T with r{u) = i and r{v) = j. For the probability P[r(M) 
i, r(t>) = j], we get 

¥[r{u) = i,r{v)=j]- ' 



E.,nriu)=^]■Rr.■Rr.■(''^]l:;^J:^^-') 
Since i?r„ and R% are independent of i and j, those factors cancel out, and we get 

F[r{u) = t] ■ CH^'f^^'^O 

F[r{u) =i,r{v) =j]- 



Further, we note that 

VrJ + |VrJ- A (|Vr| -j)! 



\VrA - 1 ; i\VrJ - l)!(|^r| - J - {\VrJ - 1))! 

Again, since (|Vr„| — 1)' is independent of i and j, this factor cancels out, and we are 
left with 

™r . ^ • / N -1 nriu)=t]-Uf-o'\\Vr\-J-k) 



E„-PK«)=^]-ni=s'"'(|Vr|-J-A:) 
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Let Q = : i < G {1, • • • , |^|}, \VrJ > j - 1}. With that notation, the 

expected edge length K[X] is 

E[X] = E[X\r{u) = i, r{v) = j]F[r{u) = i, r{v) = j] 



E 



,fe=i 



i + k 



p[^H = ^]-nS'"'(ivv 


-j 


-k) 






\Vt 


-J-k) 











Y.{i,j)en 


p[rH = z]-n£s'''(ivr 


- J - 


k)\ 





(4.3) 



Remark 4.3.1. With Equation ()4.3p . we can estimate the length of all the interior 
edges. For the pendant edges, the approach above gives us no estimate though. All 
we know is that the time from the latest interior vertex, which has rank n — 1, until 
the presence is expected to be at most 1/n where n is the number of leaves. 

Remark 4.3.2. In a supertree, we can have interior vertices which are not fully 
resolved, i.e. an interior vertex can have more than two descendants, because the 
exact resolution is unclear. Our calculation for the expected edge length assumes a 
binary tree though. 

However, we can calculate the expected edge length for each possible binary 
resolution of the supertree. Assume the supertree T has the possible binary 
resolutions Tl,...,'7^. For an edge {u,v) in T where u <r v, the expected edge 
length is calculated in the trees 7^ for i = 1, . . . , m. The expected edge length in % 
is denoted by Cj for z = 1, . . . , m. 

We calculate the expected edge length K[X] of {u, v) in the supertree T by 

E[X] = ^k^^^ (4.4) 

where the probability is calculated according to Corollary (j2.2.5p . 

Note that if m is a vertex with more than two descendants in T, v is in general 
not a direct descendant of u in %. The value Cj in resolution % is then the sum of all 
expected edge lengths on the path from m to f in %. 

Remark 4.3.3. In the primate supertree in AppendixO there are six interior vertices 
with more than two descendants (vertex labels 48, 63, 148, 153, 157 and 200). For the 
vertices labeled with 63 and 200, only one resolution is possible (up to the labeling). 

The interior vertices with label 48, 153 and 157 have three descendants each. 
So there are 3^ possible binary resolutions. The interior vertex 148 has four leaf- 
descendants. There are two possible binary resolutions (up to the labeling). To 
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calculate the expected edge lengths for the primate supertree, we therefore have 
to calculate the expected edge lengths on 3'^ • 2 binary trees and then calculate the 
weighted sum from Equation ()4.4|) . 



Chapter 5 
Speciation Rates 



This chapter was motivated by Craig Moritz and Andrew Hugall, biologists from 
Berkeley and Adelaide. They looked at a tree showing the relationships between a 
set of snails. Each of those snails lives either in rain forest or open forest. The tree has 
edge lengths assigned. Moritz and Hugall asked if the rate of speciation is different 
for rain forest snails and open forest snails. 

Mathematically, determining the rate of speciation is the following problem. The 
leaves arc divided into two classes, a and 3 (e.g. rain forest and open forest snails). 
Given the rate that a species belonging to class a changes to a species belonging to 
class /9 (and vice versa), we calculate the expected length of an edge between two 
species of group a (resp. /?). This expected length is an estimate for the inverse of 
the rate of speciation and is calculated in linear time. 

5.1 Some notation 

Definition 5.1.1. Let X' be a non-empty subset of X. Let C be a non-empty set. 
A character on X is a function % : X' — > C. C is the character state set of x- If 
X' — X , we say x is a full character. If |C| = 2, we say x is a binary character. 

Definition 5.1.2. Let T he a rooted phylogenetic X-tree with vertex set V and 
leaf set L G V. Let x be a full binary character on T, x • ^ ~^ Define 
s : V —>■ {a, (3} with s\l = x°4>~^- {'^i^) is called a. phylogenetic state tree., s a state 
function. 

In the following, the phylogenetic state tree (T, s) shall have assigned a function 
I : E ^ M"*". / shall denote the edge lengths of T. Let 77 G {a,/?} throughout this 
chapter. Let v be any node in (T, s) with s{v) = rj. We then say that the state 
of V is rj. Let 7 G {ck, (3} x {a,P} throughout the chapter, i.e. 7 = (71,72) with 
7i,72 e {a,P}. An edge e = {vi,V2) of {T,s) where vi <r V2 and s{vi) — 71, 
s{v2) = 72 is called a 7 — edge. 
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a 




app(3a(3aaap (3 a(3(3(3al3aaa(3 (3 



Figure 5.1: A phylogenetic tree with a full character on the left and a phylogenetic 
state tree on the right (without the leaf labels) . 




Figure 5.2: With s{vi) — 71 and s{v2) — 72, the edge e = (vi, V2) is a 7-edge. 
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Figure 5.3: Rate of the state change for a binary character 



5.2 Markov Chain Model 

Throughout evolution, assume that state a changes to state (3 with rate and state 
P changes to state a with rate r^j, so the rates only depend upon the state of the 
last vertex (see Fig. 15. 3|) . This means that the state change follows a Markov Chain 
model, and for that model, we want to calculate the transision matrix 

(/(e)) PaMe)) 

where ^7172(^(6)) = P [(s(f 2) = 72)|(s(fi) = 71)] with e = {vi,V2) and vi <r V2. 
The rate matrix R is defined as 



R 



Diagonalization of R yields 



rp -rp) [0 -(ra + rp) 



with 



S 



1 To 

From stochastic processes, we know that the connection between the rate matrix and 
the transition matrix is 

P'(/(e)) = RPilie)) 
Solving this differential equation yields 



P(/(e)) = P(0)e^('(")) 
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with P(0) = Id since /(e) = means staying in the vertex. Therefore P(/(e)) can be 
rewritten as 

P(/(e)) - e^^^^'^)' 

- ^^^P<o -(J+.,))^(^)>^" 

- s(^ ° ^5-^ 



The initial probabihty of vertex v being in state r] shall be tt,,, G {a, It holds 



so 



Therefore, for any given phylogenetic tree T with edge lengths /(e), the probability 
of its vertices being in states according to a state function s is 



(5.1) 



e&E 

V1<TV2 



Furthermore, it holds for any e & E with e — {vi, V2) 



^s(tii) 



Ps{v2),s{vi)iK^)) 



(5.2) 



5.3 Expected length of a 7-edge 

Given a phylogenetic tree T with character %, edge length /(e) and rate matrix R, we 
want to calculate the expected average length of a 7-edge over all (T, s). The inverse 
of this length is an estimate for the rate of speciation. 

Calculating the expected average length of a 7-edge over all (T, s) means 
calculating 

e£E, e J— edge 



# of J- edges 
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where denotes the expected value over all s given sl^, = x- Trying to calculate 
this expected value turns out to give us very nasty recursion formulas. 
So we change the problem slightly and try to calculate instead 



Define the random variable 



X,(e) := 



With that, we get 



.eeE, e 7— edge 



[#of -f- edges] 



1 if e is 7-edge 
else 



Ex E ^(^) 

.eeE, e -y—edge 

Ex [4f of -f- edges] 



E, 



.eeE 



E, 



.eeE 



5^/(e)P[(X,(e) = l)|x] 



eeE 



(5.3) 



eeE 



where P [(X^(e) = l)|x] denotes the probability of e being a 7-edge given s\l = X- So 
it is basically left to calculate P [(X^(e) = To do so, we first define two subtrees 

of T (see also Fig. 15. 4|) . Denote the end vertices of e by pi and p2 with pi <r P2- 
By deleting the 7-edge e in T, we get two new trees Ti and 7^, 7^ with pi G 7^ and 
character xi = xI^-M^Ti)' ^ with P2 G 7^ and character X2 = ^^-^(Lra) where 
Lr^ denotes the set of leaves of 7^, z G {1, 2}. The root in % shall be pi, so p becomes 
an ordinary vertex in 7i. 

^ IXili^iPi) = li)] shall denote the probability of the character Xi on the tree % 
given s{pi) = 7^. P [xr\r2l(s(pi) = 71)] shall denote the probability of the character 
Xr\T2 on the tree T \ 7^ given s(pi) = 71. f[xT, s] shall denote the probability of the 
character x and the state function s on the tree T. We denote the vertices on the 
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path from pi to p hj pi = Xi, X2, ■ ■ ■ , Xn-i, Xn = p- With ()5.1|) and ()5.2|) . it holds 

■p-rn— 1 



This yields 



nn—l 
, . i=l Ps(xi),s(xi+i) ™ r -1 



P[Xi|(s(pi) = 7i)] = 5^ 

s:s(pi)=7i 
s:s(pi)=7i 

= P [xr\rj (s(pi)=7i)] 



With that result, we get 
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T' T" 

Figure 5.5: Calculating the expected edge length: Defining T 

P[(X,(e) = l)|x] = 

_ P[(X,(e) = l)]P[x|(X,(e) = l)] 

7r,,p,,,,(/(e))P [xr\r,\{s{p,) = 7i)] P [X2|(^(P2) = 72)] 
7r,,P,,,,(/(e))P[xrJ(s(pi) =7i)]P[XrJ(s(p2) =72)] 

7= (71 .72) 

7r^,p^,^,(/(e))P[xi|(3(pi) =7i)]P[X2|(g(p2) =72)] 

Y ^7iP7i72(Ke))P[xrJ(s(pi) =7i)]P[xrJ(s(p2) =72)] 
7= (71 .72) 



(5.4) 



¥[xi\{s{pi) = 7j)] is calculated in a recursive way, starting from the bottom of the 
tree. 

Suppose we have the subtree T as in Fig. 15.51 and either ri , r2 are leaves or we 
know P [xr'|("S(r'i) = rj)] on tree T', P [xr"|(s(r2) = rj)] on tree T", for i] G {«,/?}. 
With that, we get the following recursive formulas for the probabilities on tree T. 

• For ri and r2 leaves: 



??ii^2e{o,/3} 



For ri leave, r2 interior node: 



Y ^ [^^' I ^^^^^^ = ^1)] Pvx(r2)Pmi 
P [XrK^W = ^)] = 

'?ii»?2e{a,/3} 
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• For ri and r2 interior nodes: 

r[xf\{s{r) = r])] = 

r;i:'?2 6{Q!,/3} 

Algorithm EdgeLength (T, x) 

Input: A rooted binary phylogenetic tree T and a character % on T with state change 
rates Tq, and r^j 

Output: The values \E'^ for 7 G {a,/?} x (c/. Equation ()5.3|) ) 

• Define the subtrees 7^ and 7^ of T as described above. 

• Calculate P[xril(s(Pi) = 7j)] for ^ ^ {1)2}, j G {1,2}, with the recursive 
formulas from above. 

• Evaluate P [{X^{e) = l)\x] according to ()5.4p for all 7 G {a,f3} x {a;,/9}. 

• Evaluate \E'^ according to ()5.3|) for all 7 G {a,/?} x {a,P}. 
Theorem 5.3.1. EdgeLength works correct, i.e. it returns 



Ex 



.edE, e -y—edge 



[if of -f- edges] 
The complexity is 0{\V\), so it is linear. 

Proof. The correctness of the algorithm follows from the construction above. It is left 
to verify the runtime. 

Calculating the probabilities F [{xTi\{s{pi) = 7j)] for i G {1,2}, j G {1,2} with 
the recursive formulas requires 0(|y|) calculations since we have to evaluate one 
recursion formula for each vertex. For each edge e, P[(A^(e) = l)|x] can then be 
calculated according to ()5.4|) with a constant number of calculations. So obtaining 
P[(A^(e) = l)|x] for aU e requires 0{\E\) = 0{\V\) calculations. Calculating 
according to (|5.3|) requires again 0(|ii^|) calculations. Therefore, the complexity is 
linear. □ 



Outlook 



There are several topics in the thesis which suggest further work. 

In Chapter IHl we conclude with the log-likelihood-ratio test for deciding if a tree 
evolved under Yule. The given bound for the power of the test, Equation ()H.4|1 . 
depends on the bound for the Azuma inequality. The bound Inn for the Azuma 
inequality was obtained in 13.2.11 by a lot of rough estimations. So we are very 
confident that there can be found a better bound clnn, with c « 1 being a constant. 
This would lead to an improved bound for the power of the log-likelihood-ratio test 
(i.e. one could show analytically that the log-likelihood-ratio test is very good even 
on trees with a small number of leaves). 

The edge lengths estimation in Section 14.31 will be implemented by Rutger Vos 
in Perl for his library and in Java for Mesquite (Mesquite is a tree manipulation 
software suite). Once implemented, the algorithm can finally be applied to real data. 
One can then estimate the edge lengths of a constructed supertree. 

Section El provides an algorithm for calculating and "^13^13 which estimate 
the average edge lengths. Let be the speciation rate for species of class a and 
let ip/3 be the speciation rate for species of class /5. One could test the hypothesis 
= 4'i3,i3 against 4'a,a 7^ 4'f3,i3 with the statistic For evaluating this test, i.e. 
obtaining the Type I and Type II error, one can use simulations. 

Further, in Sectional we assumed that the transition rates and rp are given. 
An interesting open question is how to handle the problem without having these 
transition rates in advance. 
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Appendix A 
List of Symbols 



Symbol Meaning page 

<x partial order on the vertices of a tree T |B] 

<T partial order on the vertices of T |B] 

(2n-l)!! (2n- 1) X (2n-3)...3 X 1 GUI 

(T, s) phylogenetic state tree [3 

(T, r) ranked phylogenetic tree T with rank function r 13 

ar^vii) \{r ^{v) =i,r ^r{T)}\ US) 

X character on a phylogenetic tree |^ 

5{v) degree of vertex v E] 

\ number of elements of V that are descendants of f 13 

TT initial probability distribution of Markov chain IHU 

p root of a tree E] 

labelling function of a phylogenetic tree T |B] 

\E'^ estimated length of a 7-edge EH 

T phylogenetic X-tree E] 

7^ Primate supertree constructed in jTHI EHl 

% phylogenetic subtree of T induced by vertex v |3 

Tx' phylogenetic subtree of T with label set X' E] 

Jp Entropy of the probabihty distribution p IT^ 

P„<^, Probability P[r(M) < r(w) IT] EHl 

Pc/ Uniform distribution on RB{X) CH 

P{/[T] Probability of T under the uniform model ^1 

Py Yule distribution on RB{X) 

Py[T] Probability of T under the Yule model 
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Catalan number 


c 


set of character states 


d{v) 


number of direct descendants of vertex v 


dKL{p, q) 


Kullbach-Liebler distance between p and q 


E,Er 


Edges of a phylogenetic tree T 


/(e) 


Length of edge e in T 




Leaf set of a (phylogenetic) tree 




probability of state change from 71 to 72 


P(/(e)) 


transition matrix of Markov chain, 




dependent on edge length 




rate of change from state a to [3 {(3 to a) 


r, rr 


rank function of phylogenetic tree T 


r(r) 


Set of rank functions on T 


rRB{n) 


Set of ranked binary phylogenetic X-trees 




with X = {1,2, ...n} 


rRB{X) 


Set of ranked binary phylogenetic X-trees 


R 


rate matrix of a Markov chain 


RB{n) 


Set of binary phylogenetic X-trees 




with X = {1,2, ...n} 


RB{X) 


Set of binary phylogenetic X-trees 


s 


state function 


V,Vt 


Set of vertices of a (phylogenetic) tree 




Set of interior vertices of a (phylogenetic) tree 



Appendix B 

Algorithms coded in Python 



# Rank functions 

# Daniel Ford, Tanja Gernhard 2006 
# 

# Functions: 
# 

# rankprob(t ,u) - returns the probability distribution 

# of the rank of vertex "u" in tree "t" 

# expectedrank(t ,u) returns the expected rank 

# of vertex "u" and the variance 

# compare (t,u,v) - returns the probability that "u" 

# is below "v" in tree "t" 



import random 

# How we store the trees: 

# The interior vertices of a tree with n leaves are 

# labeled by 1. . .n-1 

# Example input tree for all the algorithms below: 

# The tree "t" below has n=9 leaves and the inner nodes have 

# label 1. . .8 

tl = (((), 0, {'leaves.below' : 2, 'label^: 4}), () , 

{'leaves.below' : 3, 'label': 3}) 
t2 = (((), 0, {'leaves.below' : 2, 'label': 7}), ((), () , 

{'leaves_below' : 2, 'label': 8}), 

{'leaves.below' : 4, 'label': 6}) 
t3 = ((), 0, {'leaves.below' : 2, 'label': 5}) 
t4 = (tl,t3,{'leaves_below' : 5, 'label': 2}) 
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t = (t2,t4,{'leaves_below' : 9, 'label': 1}) 

# Calculation of n choose j 

# This version saves partial results for use later 
nc_matrix = [] #stores the values of nchoose(n,j) 

# — note: order of indices is reversed 
def nchoose_static(n, j ,nc_matrix) : 

if j>n: 

return 
if len(nc_matrix) <j + l : 

for i in range (len(nc_matrix) : 
nc_matrix += [[]] 
if len(nc_matrix[j] )<n+l: 

for i in range(len(nc_matrix[j] ) , j) : 

nc_matrix [ j ] += [0] 
if len(nc_matrix[j] )==j : 

nc_matrix [ j ] += [1] 
for i in range (len(nc_matrix [j] ) ,n+l) : 

nc_matrix [j] += [nc_matrix [j] [i-1] *i/ (i-j)] 
return nc_matrix[j] [n] 

# dynamic programming verion 
def nchooseCn, j) : 

return nchoose_static(n, j ,nc_matrix) 
#nc_matrix acts as a static variable 



# get the number of descendants of u and of all vertices on the 

# path to the root (subroutine for rankprob(t,u)) 
def numDescendants(t,u) : 

if t == : 

return [False, False] 
if t[2] ["label"] ==u: 

return [True, [t [2] ["leaves.below"] -1] ] 
X = numDescendants (t [0] ,u) 
if X [0] == True : 

if t[l]==(): 
n = 

else : 

n = t[l] [2] ["leaves_below"]-l 
return [True , x [ 1 ] + [n] ] 
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y = numDescendants (t [1] ,u) 
if y[0] == True: 

if t[0]==(): 
n = 

else : 

n = t[0] [2] ["leaves_below"]-l 
return [True , y [ 1 ] + [n] ] 
else : 

return [False, False] 



# A version of rankprob which uses the function numDescendants 
def rankprobCt ,u) : 

X = numDescendants (t,u) 

X = x[l] 

Ihsm = x[0] 

k = len(x) 

start = 1 

end = 1 

rp = [0,1] 

step = 1 

while step < k: 

rhsm = x[step] 

newstart = start+1 

newend = end+rhsm+1 

rp2 = [] 

for i in range (0, newend+1 ) : 
rp2+= [0] 

for i in range (newstart , newend+1 ) : 

q = max(0 , i-l-end) 

for j in range(q,min(rhsm, i-2)+l) : 

a = rp[i-j-l] *nchoose(lhsm + rhsm - (i-l),rhsm- 
*nchoose(i-2, j) 

rp2 [i] +=a 

rp = rp2 

start = newstart 

end = newend 

Ihsm = lhsm+rhsm+1 

step+=l 
tot = float (sum(rp) ) 
for i in range (0, len (rp) ) : 

rp[i] = rp[i]/tot 
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return rp 

# For tree "t" and vertex "u" calculate the 

# expected rank and variance 
def expectedrankCt ,u) : 

rp = rankprobCt ,u) 
mu = 
sigma = 

for i in range (0, len (rp) ) : 

mu += i*rp[i] 

sigma += i*i*rp[i] 
return (mu , sigma-mu*mu) 

# GCD - assumes positive integers as input 

# (subroutine for compare (t,u,v)) 
def gcd(n,m) : 

if n==m: 

return n 
if m>n: 

[n,m] = [m,n] 
i = n/m 
n = n-m*i 
if n==0: 

return m 
return gcd(m,n) 

# Takes two large integers and attempts to divide them and give 

# the float answer without overflowing 

# (subroutine for compare (t,u,v)) 

# does this by first taking out the gcd 
def gcd_divide(n,m) : 

X = gcd(n,m) 
n = n/x 
m = m/x 

return n/float(m) 

# returns the subtree rooted at the common ancestor of u and v 

# (subroutine for compare (t,u,v)) 
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# return 

# True/False - have we found u yet 

# True/False - have we found v yet 

# the subtree - if we have found u and v 

# the u half of the subtree 

# the V half of the subtree 
def subtree (t, u, v) : 

if t == : 

return [False , False , False , False , False] 
[a,b, c,xl ,x2] =subtree (t [0] ,u,v) 
[d, e,f ,yl ,y2] =subtree(t [1] ,u,v) 
if (a and b) : 

return [a,b,c,xl,x2] 
if (d and e) : 

return [d,e,f ,yl,y2] 

# 

X = (a or d or t [2] ["label"] ==u) 
y = (b or e or t [2] ["label"] ==v) 
# 

tl = False 
t2 = False 
# 

if a: 

tl = xl 
if b: 

t2 = x2 
if d: 

tl = yl 
if e: 

t2 = y2 

# 

if X and (not y) : 

tl = t 
elif y and (not x) : 
t2 = t 

# 

if t[2] ["label"]==u: 

tl = t 
if t[2] ["label"] ==v: 

t2 = t 
return [x,y ,t ,tl ,t2] 
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# Gives the probability that vertex labeled v is 

# below vertex labeled u 
def compare (t ,u,v) : 

[a,b,c,d,e] = subtree (t ,u,v) 
if not (a and b) : 

print "This tree does not have those vertices!" 

return 
if (c[2] ["label"] ==u) : 

return 1 . 
if (c[2] ["label"] ==v) : 

return 0.0 
tu = d 
tv = e 

usize = d [2] ["leaves_below"] -1 

vsize = e [2] ["leaves_below"] -1 

X = rankprob(tu,u) 

y = rankprob(tv,v) 

for i in range (len(x) ,usize+2) : 

x+= [0] 
xcumulative = [0] 
for i in range ( 1, len (x) ) : 

xcumulative+= [xcumulative [i-1] +x [i] ] 
rp = [0] 

for i in range (1, len (y) ) : 
rp+= [0] 

for j in range (1, usize+1 ) : 

a = y [i] *nchoose(i-l+j , j)*nchoose(vsize-i+usize 
usize-j)*xcumulative [j] 

rp [i] +=a 
tot = nchoose(usize+vsize, vsize) 
return sum(rp)/f loat(tot) 



Appendix C 
Primate Supertree 




MYA 



6£ eO SE SO 4E «i 3£ 30 25 20 IS 10 £ 



78 



APPENDIX C. PRIMATE SUPERTREE 




Pctph kdsKddTyd!: 
124 

Mctcctcct m^ctttct 



T 1 1 1 1 1 1 1 1 1 UYA 

E 



Figure C.l: Primate Supertree - Figure 
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Figure C.2: Primate Supertree - Figure 
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Figure C.3: Primate Supertree - Figure 
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Figure C.4: Primate Supertree - Figure 
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Figure C.5: Primate Supertree - Figure 
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Figure C.6: Primate Supertree - Figure 



APPENDIX C. PRIMATE SUPERTREE 




Prtkarrd mDStdcku^ 
Pr^kxrd dsqiidi£?rjdl js 
Prtkarrd jrTmtiid 
Pjikscjd pjiksckt 
Cdcdjdi? cdlviis 

Ck TTSfKPt^s dlbrKdsiis 
Ck f Jiiyra!-!? J !:dtd}td^ 



I I I I I I I I I I I I I I I I I I I I I MYA 
20 IE 10 S 



Figure C.7: Primate Supertree - Figure 9 
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Figure C.8: Primate Supertree - Figure 10 
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Figure C.9: Primate Supertree - Figure 11 
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Figure C.IO: Primate Supertree - Figure 
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Figure C.ll: Primate Supertree - Figure 
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